Saturday, March 10, 2012

Resource Isolation in Condor using cgroups

This is the last in my series on job isolation techniques.  It has spanned in postings over the last month, so it may help to recap:

  • Part I covered process isolation, prevent processes in one job from interacting with other jobs.  This has been achievable through POSIX mechanisms for awhile, but the new PID namespaces mechanisms provide improved isolation for jobs running as the same user.
  • Part II and Part III discussed file isolation using bind mounts and chroots.  Condor uses bind mounts to remove access to "problematic" directories such as /tmp.  While more complex to setup, chroots allow jobs to run in a completely separate environment as the host and further isolates the job sandbox.
This post will cover resource isolation: preventing jobs from consuming system resources promised to another job.

Condor has always had some crude form of resource isolation.  For example, the worker node could be configured to detect when the processes in a job have more CPU time than walltime (a rough indication that more than one core is being used) or when the sum of each process's virtual memory size exceeds the memory requested for the job.  When Condor detects too many resources are being consumed, it can take an action such as suspending or killing the job.

This traditional approach is relatively unsatisfactory for a few reasons:
  • Condor periodically polls to view resource consumption.  Any activity between polls is unmonitored.
  • The metrics Condor traditionally monitors are limited to memory and CPU, where the memory metrics are poor quality for complex jobs.  The sum many process's virtual memory size, on a modern Linux box, has little correlation with RAM used and is not particularly meaningful.
  • We can do little with the system besides detect when resource limits have been violated and kill the job.
    • We cannot, for example, simply instruct the kernel to reduce the job's memory or CPU usage.
    • Accordingly, users must ask for peak resource usage, which may be well-above average resource usage, decreasing overall throughput.  If the job needs 2GB on average but 4GB for a single second, the user will ask for 4GB; the other 2GB will be un-utilized.
In Linux, the oldest form of resource isolation is processor affinity or CPU pinning: a job can be locked to a specific CPU, and all its processes will inherit the affinity.  Because two jobs are locked to separate CPUs, they will never consume each others' CPU resources.  CPU pinning is unsatisfactory for reasons similar to memory: jobs can't utilize otherwise-idle CPUs, decreasing potential system throughput.  The granularity is also poor: you can't evenly fairshare 25 jobs on a machine with 24 cores as each job must be locked to at least one core.  However, it's a step forward - you don't need to kill jobs for using too much CPU - and present in Condor since 7.3.

Newer Linux kernels support cgroups, which allow are structures for managing groups of processes, and provide controllers for managing resources in each cgroup.  In Condor 7.7.0, cgroup support was added for measuring resource usage.  When enabled, Condor will place each job into a dedicated cgroup for the block-I/O, memory, CPU, and "freezer" controllers.  We have implemented two new limiting mechanisms based on the memory and CPU controllers.

The CPU controller provides a mechanism for fairsharing between different cgroups.  CPU shares are assigned to jobs based on the "slot weight" (by default, equal to the number of cores the job requested).  Thus, a job asking for 2 cores will get an average of 2 cores on a fully loaded system.  If there's an idle CPU, it could utilize more than 2 cores; however, it will never get less than what it requested for a significant amount of time.  CPU fairsharing provides a much finer granularity than pinning, easily allowing the jobs-to-cores ratio be non-integer.

The memory controller provides two kinds of limits: soft and hard.  When soft limits are in place, the job can use an arbitrary amount of RAM until the host runs out of memory (and starts to swap); when this happens, only jobs over their limit are swapped out.  With hard limits, the job immediately starts swapping once it hits its RAM limit, regardless of the amount of free memory.  Both soft and hard limits default to the amount of memory requested for the job.

Both methods also have disadvantages.  Soft limits can cause "well-behaved" processes to wait while the OS frees up RAM from "badly behaving" process.  Hard limits can cause large amounts of swapping (for example, if there's a memory leak), decreasing the entire node's disk performance and thus adversely affecting other jobs.  In fact, it may be a better use of resources to preempt a heavily-swapping process and reschedule it on another node than let it continue running.  There is further room for improvement here in the future.

Regardless, cgroups and controllers provide a solid improvement in resource isolation for Condor, and finish up our series on job isolation.  Thanks for reading!

Monday, February 27, 2012

Improving File Isolation with chroot

In the last post, we examined a new Condor feature called MOUNT_UNDER_SCRATCH that will isolate jobs from each other on the file system by making world-writable directories (such as /tmp and /var/tmp) be unique and isolated per-batch-job.

That work started with the assumption that jobs from the same Unix user don't need to be isolated from each other.  This isn't necessarily true on the grid: a single, shared account per-VO is still popular on the OSG.  For such VOs, an attacker can gain additional credentials by reading the sandbox of each job running under the same Unix username.

To combat proxy-stealing, we use an old Linux trick called a "chroot".  A sysadmin can create a complete copy of the OS inside a directory, and an appropriately-privileged process can change the root of its filesystem ("/") to that directory.  In fact, the phrase "changing root" where we get the "chroot" terminology.

For example, suppose the root of the system looks like this:

[root@localhost ~]# ls /
bin     cvmfs         hadoop-data2  home        media  opt   selinux  usr
boot    dev           hadoop-data3  lib         misc   proc  srv      var
cgroup  etc           hadoop-data4  lib64       mnt    root  sys
chroot  hadoop-data1  hadoop.log    lost+found  net    sbin  tmp

The sysadmin can create a copy of the RHEL5 operating system inside a sub-directory; at our site, this is /chroot/sl5-v3/root:

[root@localhost ~]# ls /chroot/sl5-v3/root/
bin   cvmfs  etc   lib    media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt    proc  sbin  srv      tmp  var

Note how the contents of the chroot directory are stripped down relative to the OS - we can remove dangerous binaries, sensitive configurations, or anything else unnecessary to running a job.  For example, many common Linux privilege escalation exploits come from the presence of a setuid binary.  Such binaries (at, cron, ping) are necessary for managing the host, but not necessary for a running job.  By eliminating the setuid binaries from the chroot, a sysadmin can eliminate a common attack vector for processes running inside.

Once the directory is built, we can call chroot and isolate ourselves from the host:

[root@red-d15n6 ~]# chroot /chroot/sl5-v3/root/
bash-3.2# ls /
bin   cvmfs  etc   lib   media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt  proc  sbin  srv      tmp  var

Condor, as of 7.7.5, now knows how to invoke the chroot syscall for user jobs.  However, as the job sandbox is written outside the chroot, we must somehow transport it inside before starting the job.  Bind mounts - discussed last time - come to our rescue.  The entire process goes something like this:
  1. Condor, as root, forks off a new child process.
  2. The child uses the unshare system call to place itself in a new filesystem namespace.
  3. The child calls mount to bind-mount the job sandbox inside the chroot.  Any other bind mounts - such as /tmp or /var/tmp - are done at this time.
  4. The child will invoke the chroot system call specifying the directory the sysadmin has configured.
  5. The child drops privileges to the target batch system user, then calls exec to start the user process.
With this patch applied, Condor will copy only the job's sandbox forward into the filesystem namespace, meaning the job has access to no other sandbox (as all other sandboxes live outside the private namespace).  This successfully isolates jobs from each other's sandboxes, even if they run under the same Unix user!

The Condor feature is referred to as NAMED_CHROOT, as sysadmins can created multiple chroot-capable directories, give them a user-friendly name (such as RHEL5, as opposed to /chroot/sl5-v3/root), and allow user jobs to ask for the directory by the friendly name in their submit file.

In addition to the security benefits, we have found the NAMED_CHROOT feature allows us to run a RHEL5 job on a RHEL6 host without using virtualization; something for the future.

Going back to our original list of directories needing isolation - system temporary directories, job sandbox, shared filesystems, and GRAM directories - we have now isolated everything except the shared filesystems. The option here is simple, if unpleasant: mount the shared file system as read-only.  This is the modus operandi for $OSG_APP at many sites, and an acceptable (but not recommended) way to run $OSG_DATA (as $OSG_DATA is optional anyway).  It restricts the functionality for the user, but brings us a step closer to our goal of job isolation.

After file isolation, we have one thing left: resource isolation.  Again, a topic for the future.

Monday, February 20, 2012

File Isolation using bind mounts and chroots

The last post ended with a new technique for process-level isolation that unlocks our ability to safely use anonymous accounts and group accounts.

However, that's not "safe enough" for us: the jobs can still interact with each other via the file system.  This post examines the directories where jobs can write into, and what can be done to remove this access.

On a typical batch system node, a user can write into the following directories:

  • System temporary directories: The Linux Filesystem Hierarchy Standard (FHS) provides at least two sticky, world-writable directories, /tmp and /var/tmp.  These directories are traditionally unmanaged (user processes can write an uncontrolled amount of data here) and a security issue (symlink attacks and information leaks), even when user separation is in place.
  • Job Sandbox: This is a directory created by the batch system as a scratch location for the job.  The contents of the directory will be cleaned out by the batch system after the job ends.  For Condor, any user proxy, executable, or job stage-in files will be copied here prior to the job starting.
  • Shared Filesystems: For a non-grid site, this is typically at least $HOME, and some other site-specific directory.  $HOME is owned by the user running the job.  On the OSG, we also have $OSG_APP for application installation (typically read-only for worker nodes) and, optionally, $OSG_DATA for data staging (writable for worker nodes).  If they exist and are writable, $OSG_APP/DATA are owned by root and marked as sticky.
  • GRAM directories: For non-Condor OSG sites, a few user-writable directories are needed to transfer the executable, proxy, and job stage-in files from the gatekeeper to the worker node.  These default to $HOME, but can be relocated to any shared filesystem directory.  For Condor-based OSG sites, this is a part of the job sandbox.
If user separation is in place and considered sufficient, filesystem isolation is taken care of for shared filesystems, GRAM directories, and the job sandbox.  The systemwide temporary directories can be protected by mixing filesystem namespaces and bind mounts.

A process can be launched in its own filesystem namespace; such a process will have a copy of the system mount table.  Any change made to the process's mount table will not be seen by the outside system, and will be shared with any child processes.

For example, if the user's home directory is not mounted on the host, the batch system could create a process in a new filesystem namespace and mount the home directory in that namespace.  The home directory will be available to the batch job, but to no other process on the filesystem.

When the last process in the filesystem namespace exits, all mounts that are unique to that namespace will be unmounted.  In our example, when the batch job exits, the kernel will unmount the home directory.

A bind mount makes a file or directory visible at another place in the filesystem - I think of it as mirroring the directory elsewhere.  We can take the job sandbox directory, create a sub-directory, and bind-mount the sub-directory over /tmp.  The process is mostly equivalent to the following shell commands (where $_CONDOR_SCRATCH_DIR is the location of the Condor job sandbox) in a filesystem namespace:

mount --bind $_CONDOR_SCRATCH_DIR/tmp /tmp

Afterward, any files a process creates in /tmp will actually be stored in $_CONDOR_SCRATCH_DIR/tmp - and cleaned up accordingly by Condor on job exit.  Any system process not in the job will not be able to see or otherwise interfere with the contents of the job's /tmp unless it can write into $_CONDOR_SCRATCH_DIR.

Condor refers to this feature as MOUNT_UNDER_SCRATCH, and will be a part of the 7.7.5 release.  This will be an admin-specified list of directories on the worker node.  With it, the job will have a private copy of these directories, which will be backed by $_CONDOR_SCRATCH_DIR.  The contents - and size - of these will be managed by Condor, just like anything else in the scratch directory.

If user separation is unavailable or not considered sufficient (if there are, for example, group accounts), an additional layer of isolation is needed to protect the job sandbox.  A topic for a future day!

Tuesday, February 14, 2012

Job Isolation in Condor

I'd like to share a few exciting new features under construction for Condor 7.7.6 (or 7.9.0, as it may be).

I've been working hard to improve the job isolation techniques available in Condor.  My dictionary defines the verb "to isolate" as "to be or remain alone or apart from others"; when applied to the Condor context, we'd like to isolate each job from the others.  We'll define process isolation as the inability of a process running in a batch job to interfere with a process not a part of the job.  Interfering with processes on Linux, loosely defined, means the sending of POSIX signals, taking control via the ptrace mechanism, or writing into the other process's memory.

Process isolation is only one aspect of job isolation.  Job isolation also includes the inability to interfere with other jobs' files (file isolation) and not being able to consume others' system resources such as CPU, memory, or disk (resource isolation).

In Condor, process isolation has historically been accomplished via one of two mechanisms:

  • Submitting user.  Jobs from Alice and Bob will be submitted as the unix users alice and bob, respectively.  In this model, the jobs running on the worker node will be run as users alice and bob, respectively.  The processes in the job running under user bob are protected from the processes in the job running as user alice via traditional POSIX security mechanisms.
    • This model makes the assumption that jobs submitted by the same user do not need isolation from each other.  In other words, there shouldn't be any shared user accounts!
    • This model also assumes the submit host and the worker node share a common user namespace.  This can be more difficult to accomplish than it sounds: if the submit host has thousands of unique users, we must make sure each functions on the worker node.  If the submit host is on a remote site with a different user namespace from the worker node, this may not be easily achievable!
  • Per-slot users.  Each "slot" (roughly corresponding to a CPU) in condor is assigned a unique unix user.  The job currently running in that slot is run under the associated username.
    • This solves the "gotchas" noted above with the submitting user isolation model.
    • This is difficult to accomplish in-practice if the job wants to utilize a filesystem shared between the submit and worker nodes.  The filesystem security is based on two users having distinct Unix user names; in this model, there's no way to mark your files as only readable by your own jobs.
Notice both techniques require on user isolation to accomplish process isolation.  Condor has an oft-overlooked third mode:

  • Mapping remote users to nobody.  In this mode, local users (where the site admin can define the meaning of "local") get mapped to the submit host usernames, but non-local users all get mapped to user nobody - the traditional unprivileged user on Linux.
    • Local users can access all their files, but remote users only get access to the batch resources - no shared file systems.
Unfortunately, this is not a very secure mode as, according to the manual, the nobody account "... may also be used by other Condor jobs running on the same machine, if it is a multi-processor machine"; not very handy advice in an age where your cell phone likely is a multi-processor machine!

This third mode is particularly attractive to us - we can avoid filesystem issues for our local users, but no longer have to create the thousands of accounts in our LDAP database for remote users.  However, since jobs from remote users run under the same unix user account, the traditional security mechanism of user separation does not apply - we need a new technique!

Enter PID namespaces, a new separation technique introduced in kernel 2.6.24.  By passing an additional flag when creating a new process, the kernel will assign an additional process ID (PID) to the child process.  The child will believe itself to be PID 1 (that is, when the child calls getpid(), it returns 1), while the processes in the parent's namespace will see a different PID.  The child will be able to spawn additional processes - all will be stuck in the same inner namespace - that similarly have an inner PID different from the outer one.

Processes within the namespace can only see and interfere (send signals, ptrace, etc) with other processes inside the namespace.  By launching the new job in its own PID namespace, Condor can achieve process isolation without user isolation: the job processes are isolated from all other processes on the system.

Perhaps the best way to visualize the impact of PID namespaces in the job is to examine the output of ps:

[bbockelm@localhost condor]$ condor_run ps faux
bbockelm     1  0.0  0.0 114132  1236 ?        SNs  11:42   0:00 /bin/bash /home/bbockelm/.condor_run.3672
bbockelm     2  0.0  0.0 115660  1080 ?        RN   11:42   0:00 ps faux

Only two processes can be seen from within the job - the shell executing the job script and "ps" itself.

Releasing a PID namespaces-enabled Condor is an ongoing effort:; I've recently re-designed the patch to be far less intrusive on the Condor internals by switching from the glibc clone() call to the clone syscall.  I am hopeful it will make it in the 7.7.6 / 7.9.0 timescale.

From a process isolation point-of-view, with this patch, it now is safe to run jobs as user "nobody" or re-introduce the idea of shared "group accounts".  For example, we could map all CMS users to a single "cmsuser" account without having to worry about these becoming a vector for virus infection.

However, the story of job isolation does not end with PID namespaces.  Stay tuned to find out how we are tackling file and resource isolation!

Friday, January 27, 2012

openstack - update

Last time I was able to deploy an image. Next step would be to list it and then run. But I have hit problems.

To list images I run command:


which hangs up forever and after long time exits with message "connection reset by peer".

I have disabled iptables to eliminate firewall issues. No help.

All manuals assume that euca-describe-images should simply run and do not give instruction what to do if it does not.

Following Josh's advice I did:

strace -o edi_output -f -ff euca-describe-images

and then I looked into the output files. It seems that there might be two problems:

  1. Some euca2ools files are missing - in particular the .eucarc configuration file.
  2. There are messages about missing python files, like for example "open("/usr/lib64/python2.6/site-packages/gtk-2.0/", O_RDONLY) = -1 ENOENT (No such file or directory)" (There are manu more like that).
So it seems that the eucatools installation described in previous posts may be not complete - and it missed some key files. Or python (which we already know had to be patched) is not OK. Or both.

That's all I know for now.

Tuesday, January 10, 2012

How to register an image in openstack

After having installed and configured the worker and controller nodes of the openstack testbed we would like to upload images into it.

First I downloaded some images to /root/images on controller node. One is from Xin and another one is a minimal image for testing I got from the net. I have no idea what are they worth.

Then I tried to follow the instructions

which go like this:

uec-publish-tarball $image [bucket-name] [hardware-arch]

and I could not find where does the

command comes from. Finally I realized that it comes from Ubuntu and the manual became Ubuntu specific without saying it explicitly.

So I tried different approach.

cd /root/images

glance add name="My Image" < sl61-kvm.tar.bz2 # the image I got from Xin

The command responded that the image got Id=1, which is a good sign.

Then I did:

glance show 1

and got:

Id: 1
Public: No
Name: My Image
Size: 199737477
Location: file:///var/lib/glance/images/1
Disk format: raw
Container format: ovf

Which suggests that the file is in the system. But when I tried:

glance index

it said:

no public images found

So I tried to register it again:

glance add name="My Image" is_public=true < sl61-kvm.tar.bz2
Added new image with ID: 2

I tried to list:

glance index
Found 1 public images...
ID Name Disk Format Container Format Size
---------------- ------------------------------ -------------------- -------------------- --------------
2 My Image raw ovf 199737477

So it seems we have uploaded an image to the system.

Now I have to figure out how to run it.

Friday, January 6, 2012

How to configure worker node - part 2

Compute node configuration - continued

We execute the following commands:

This command is supposed to synchronize the database:

/usr/bin/nova-manage db sync
Now we have to create users and projects. We call both users and projects "nova"

/usr/bin/nova-manage user admin nova
/usr/bin/nova-manage project create nova nova
/usr/bin/nova-manage network create 1 256

We check that users and projects were created correctly:

/usr/bin/nova-manage project list

/usr/bin/nova-manage user list

Create Certifications

On the controller node execute

mkdir –p /root/creds

/usr/bin/python /usr/bin/nova-manage project zipfile nova nova /root/creds/

If you encounter a python error, then apply the python patch described few posts earlier.

Create /root/creds on the compute node and copy the file there. Then unpack it

unzip /root/creds/ -d /root/creds/

A few files will appear, among them
/root/creds/novarc . This file needs to be appended to .bashrc, but there is a catch:
first line of the file has an error and has to be replaced:

Original line:

NOVA_KEY_DIR=$(pushd $(dirname $BASH_SOURCE)>/dev/null; pwd; popd>/dev/null)

has to be replaced with


The content of novarc file now is


export EC2_URL=""
export S3_URL=""
export EC2_USER_ID=42 # nova does not use user id, but bundling requires it
export EC2_PRIVATE_KEY=${NOVA_KEY_DIR}/pk.pem
export EC2_CERT=${NOVA_KEY_DIR}/cert.pem
export NOVA_CERT=${NOVA_KEY_DIR}/cacert.pem
export EUCALYPTUS_CERT=${NOVA_CERT} # euca-bundle-image seems to require this set
alias ec2-bundle-image="ec2-bundle-image --cert ${EC2_CERT} --privatekey ${EC2_PRIVATE_KEY} --user 42 --ec2cert ${NOVA_CERT}"
alias ec2-upload-bundle="ec2-upload-bundle -a ${EC2_ACCESS_KEY} -s ${EC2_SECRET_KEY} --url ${S3_URL} --ec2cert ${NOVA_CERT}"
export NOVA_USERNAME="nova"
export NOVA_URL=""

Where "XXXX.." strings denote keys which I do not post here, for security.

The content of novarc file should now be added to bashrc:

cat /root/creds/novarc >> ~/.bashrc source ~/.bashrc

This should be done both on compute and controller nodes.

Enable access to worker node

First unset a proxy and then do:

euca-authorize -P icmp -t -1:-1 default euca-authorize -P tcp -p 22 default

Thursday, January 5, 2012

How to configure worker node

In the following I will describe how to configure the worker node. I assume that the worker node has been already installed following the instructions posted on this blog.

Firs of all, before we start, we still need to add nova-network (it has not been installed so far).


yum install openstack-nova-network

Once this is done, we can go on and edit the /etc/nova/nova.conf file.

First, add to the file the option


The relevant switches are:


In the end the configuration file should look like:

--logging_context_format_string=%(asctime)s %(name)s: %(levelname)s [%(request_id)s %(user)s %(project)s] %(message)s
--logging_default_format_string=%(asctime)s %(name)s: %(message)s

where {USER},{PWD} and {DATABASE} denote nova database user, pasword and database name.

Now go to the controller node and open the following ports for incoming connections: 3333,3306,5672,8773,8000.

Go back to worker node and prepare /root/bin/ script with the following content:

for n in ajax-console-proxy compute vncproxy network; do
service openstack-nova-$n $@;

Then run

/root/bin/ stop
Stopping OpenStack Nova Web-based serial console proxy: [ OK ]
Stopping OpenStack Nova Compute Worker: [ OK ]
Stopping OpenStack Nova VNC Proxy: [ OK ]
Stopping OpenStack Nova Network Controller: [ OK ]
[root@gridreserve30 compute]# /root/bin/ start
Starting OpenStack Nova Web-based serial console proxy: [ OK ]
Starting OpenStack Nova Compute Worker: [ OK ]
Starting OpenStack Nova VNC Proxy: [ OK ]
Starting OpenStack Nova Network Controller: [ OK ]

to be continued...