Wednesday, March 19, 2014

Submitting jobs to HTCondor using Python

I've had several requests to for a tutorial on using the HTCondor python bindings; current documentation resources for these include:

My presentation at the 2013 HTCondor week.
The HTCondor manual page.
Python's built-in help() facility.
The HTCondor users mail list.

However, more examples are always useful! This blog entry will attempt to cover the most common use cases - ClassAds, querying HTCondor, and submitting jobs.

Why Python Bindings?

Before we launch into the how, let's examine the why. The python bindings provide a developer-friendly mechanism for interacting with HTCondor. A few highlights:

They call the HTCondor libraries directly, avoiding a fork/exec of a subprocess.
They provide a "pythonic" interaction with HTCondor; the design is meant to be familiar to a python programmer. Errors raise python exceptions.
They have thorough integration with ClassAds. Because they use the HTCondor implementation of ClassAds, the result is a very complete implementation of the ClassAd language. ClassAd expressions can be created cleanly without worrying about string quoting issues.
Most actions that can be performed through the HTCondor command-line tools are exposed via python.

The bindings themselves are compiled against the system version of python and a specific version of HTCondor. This limits the portability (you cannot reliably email compiled binaries to others), meaning they are most effective when they are installed onto the system by the sysadmin; that said, they are shipped with all HTCondor versions supported by UW except for Windows.

Loading the modules

The bindings are split into two python modules, htcondor and classad. To verify your environment is setup correctly, do the following in python:

$ python

Python 2.7.5 (default, Aug 25 2013, 00:04:04)

[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> import classad

>>> import htcondor

>>>

If no exception is thrown, you are ready to proceed to the next section! If an exception is thrown, check your HTCondor installation and the value of the PYTHONPATH environment variable if you are using a non-root install.

Begin with the Basics: ClassAds

ClassAds are the lingua franca of HTCondor, and hence the basic essential data structure of the python bindings. Each ad is formed as a set of key-value pairs, where the value is a ClassAd expression (such as 2 + 2). This differs from a JSON map, where the value must be a literal (4). When evaluating expressions, one can reference other attributes in the ClassAd.

Consider the following ClassAd interaction:

>>> ad = ClassAd()

>>> ad['foo'] = 1

>>> ad['bar'] = 2

>>> ad['baz'] = ExprTree("foo + bar")

>>> ad

[ baz = foo + bar; bar = 2; foo = 1 ]

>>> ad['baz']

foo + bar

>>> ad['baz'].eval()

3L

>>>

We first create an empty ClassAd, then do some value assignments in a manner similar to a Python dictionary. For the baz attribute, we create a new ExprTree (a ClassAd expression) object. The string given to the ExprTree constructor is parsed as a new python expression.

Note that if we reference baz, the expression itself is returned; if we instead referenced foo, the python object 1 would be returned. The classad library will coerce references to python objects if possible; if not possible, it will return ExprTrees. To force the return of an ExprTree, use the lookup method of the ClassAd; to force the return of a python object, use the eval method.

In 8.1.3, HTCondor introduced more convenient ways to build expressions. We could replace ExprTree("foo + bar") above with:

Attribute("foo") + Attribute("bar")

We believe that explicitly forming expressions in this manner is less likely to result in quoting issues (analogous to how one avoids SQL injection attacks).

ClassAd expressions include the most common programming operators, lists, sub-ClassAds, attribute references, function calls, strings, numbers, and booleans. See the full language description for a thorough treatment.

Querying HTCondor

The two most common daemons to query in HTCondor are the collector (which holds descriptions for all daemons running in the pool) and the schedd (which maintains the job queue).

We'll start with the collector. Begin by creating a Collector object:

>>> coll = Collector()

The collector object will default to the collector daemon in the machine's configuration; alternately, the constructor accepts a hostname as a string argument.

Once created, you can use the query method to get ClassAds from the collector:

>>> ads = coll.query(htcondor.AdTypes.Startd)

len(ads)

>>> len(ads)

4128

This returns a Python list of ClassAds. By default all attributes for all ClassAds of a given type are returned by query; returning such a large amount of data can take a long amount of time. Further function arguments refine the amount of data returned:

>>> ads = coll.query(htcondor.AdTypes.Startd, 'Machine =?= "red-d9n1.unl.edu"', ["Name", "RemoteOwner"])

>>> len(ads)

15

>>> ads[0]

[ Name = "slot1@red-d9n1.unl.edu"; MyType = "Machine"; TargetType = "Job"; CurrentTime = time() ]

The second argument provides a ClassAd expression which serves as a filter; the third argument is a list of attributes to include. Note that the collector may add some default attributes and may not return a requested attribute if it is not present in the ad.

The creation of a Schedd object can be done in a manner similar to the Collector for a local schedd:

>>> schedd = htcondor.Schedd()

Alternately, you can use the Collector's locate method to find a remote Schedd address:

>>> addr = coll.locate(htcondor.DaemonTypes.Schedd, "schedd.example.com")
>>> schedd = htcondor.Schedd(addr)

Once the schedd object is created, the query method is used to list jobs:

>>> jobs = schedd.query()
>>> len(jobs)
2096

Again, additional arguments allow you to trim the number of ads and the number of attributes returned:

>>> jobs = schedd.query('Owner=?="cmsprod088"', ["ClusterId", "JobStatus"])

>>> len(jobs)

336

>>> jobs[0]

[ MyType = "Job"; JobStatus = 2; TargetType = "Machine"; ServerTime = 1395254896; CurrentTime = time(); ClusterId = 2940860 ]

Starting in 8.1.5, the xquery method has been added. Instead of buffering all ads in memory in the form of a python list, xquery returns an iterator; reading through the iterator will block as ClassAds are returned by the schedd. This reduces total memory usage and allows the user to interleave several queries at once.

Submitting Jobs

Submitting jobs is one of the more confusing aspects of the Python bindings for beginners. This is because job descriptions must be provided as a ClassAd instead of HTCondor submit file format. The submit file format is a macro substitution language evaluated at submit time.

For example, consider the following submit file:

executable = test.sharguments = foo bar
log = test.log
output = test.out.$(Process)
error = test.err
transfer_output_files = output
should_transfer_files = yes
queue 1

The equivalent submit ClassAd is:

[
  Cmd = "test.sh";
  Arguments = "foo bar"
  UserLog = "test.log";
  Out = strcat("test.out",ProcId);
Err = "test.err";
TransferOutput = "output";
  ShouldTransferFiles = "YES";
]

A few items of note for converting submit files to ClassAds:

The translation from the submit file commands to ClassAd attributes often results in different attribute names (executable corresponds to Cmd). An extensive, but not exhaustive, list of attribute is available in the HTCondor manual.
Some submit file commands result in multiple attribute changes in the ClassAd. If you are unsure how a submit file command maps to a ClassAd, you can run condor_submit -dump /dev/null test.submit to have HTCondor dump the resulting ClassAd to stdout. This command includes all attributes, including ones that are auto-filled; do not copy the entire ad, but look just for the changes.
Submit file commands do not have a type and the quoting rules differs for different commands; you must properly quote strings in the ClassAd using the ClassAd language rules.
Macro substitution is not available by ClassAds. Notice how test.out.$(Process) in the submit file is strcat("test.out",ProcId) in the ClassAd; the latter is evaluated at runtime.

Once you have your ClassAd prepared, submitting it is straightforward:

>>> schedd = htcondor.Schedd()
>>> schedd.submit(ad)
23498

The return value is the Cluster ID. To submit multiple jobs in the same job cluster, you can pass a second argument to submit. For example, to submit 5 jobs:

>>> schedd.submit(ad, 5)
23499

Parting Thoughts

In this entry, we covered the basics of using the HTCondor python bindings. We covered only about 10% of the API; left untouched were advanced ClassAd topics, manipulating jobs, remote submission, and managing running daemons.

I hope to have a few more entries to cover other aspects of the API. Have a particular request? Leave a comment!

Monday, January 28, 2013

Introducing the HTCondor-CE

At the heart of the OSG Compute Element (CE) is the gatekeeper software. The gatekeeper software anchors three core pieces of functionality:

Remote access: The gatekeeper provides a network service that remote clients can contact and interact with.
Authentication and authorization: The gatekeeper is responsible for authenticating the client and deciding on what actions it is authorized to perform.
Resource allocation: The gatekeeper accepts an abstract description of a resource to allocate and actualizes the resource request within the local environment.

The existing software, Globus GRAM, provides a HTTP-like interface over TLS for remote access. The authentication is done using the Grid Security Infrastructure (GSI), using special client certificates. It does authorization by performing a callout to map the client certificate to a Unix account, then performing all further operations as that Unix user. The resource allocation provides an interface which accepts requests in Globus RSL (a job description language) and interact with a local batch system on the CE to run the job.

With the HTCondor team, the OSG has been working to provide an alternate gatekeeper implementation, the HTCondor-CE. The HTCondor-CE is a special configuration of the HTCondor software which provides the three core pieces of functionality described above.

HTCondor provides remote access using a custom communication protocol and called CEDAR. CEDAR provides a RPC and messaging mechanism over UDP or TCP, and can provide various levels of integrity or encryption based upon the session parameters. While the HTCondor-CE will ship with the same GSI authentication and authorization as Globus GRAM, it can be reconfigured to provide alternate authentication mechanisms such as Kerberos, SSL, shared secret, or even IP-based authentication.

The HTCondor-CE allocates resources via having the client submit HTCondor jobs to a scheduler running on the CE (the schedd daemon). We refer to this as the "grid job". A separate daemon, the JobRouter, is responsible for transforming the grid job to a resource allocation for site. For a site with a HTCondor batch system, it will transform and mirror the grid job into the routed job in the site's batch system. The process is illustrated below:

The submit workflow for the HTCondor-CE running on a site with the HTCondor batch system. Notice the JobRouter copies the job directly into the site's batch system.

For sites with the PBS batch system, the routed job stays in the HTCondor-CE schedd (as the JobRouter does not know how to submit directly into the PBS queue), and the job is submitted into PBS using the blahp daemon. See the illustration below:

The HTCondor-CE submit workflow for a PBS site. Notice the blahp, not the JobRouter, does the submission to PBS in this case.

The blahp daemon is a common piece of software for interacting with batch systems - in addition to being integrated in the HTCondor grid universe, it also is used by the BOSCO project and the CREAM CE.

Note there is no requirement that the job be routed into a batch system - given the appropriate transform logic. the JobRouter could also transform the grid job into VM running in Amazon EC2, an OpenStack instance, or a job for another HTCondor-CE!

The CE is quite flexible; it is a configuration of the HTCondor software and leverages all the features available in HTCondor. As another example, we benefit from the fact that HTCondor's security uses sessions; clients do not re-authenticate for each status update. Future features, such as the sandbox size limits in the upcoming 7.9.4, can be used immediately by the CE through a configuration file change.

The HTCondor-CE is currently under development, although functionality has been demonstrated using glideinWMS for up to 5,000 running pilots. It requires HTCondor 7.9.2 or later, so we are waiting for the next stable release (due late April) before starting to release the CE more widely. As we near release, I am planning on doing additional updates on specific pieces of this technology.

We're looking forward to see how users will put it into action!

Saturday, January 5, 2013

Fun with ClassAds

One of the new technologies the OSG Technology area is working on is the HTCondor-CE. While that is a topic for a different post, it led me on a surprising journey over my Christmas break.

Working with the HTCondor-CE, I found that creating a job hook to be surprisingly difficult. A job hook for HTCondor is an external script, invoked by HTCondor in lieu of running internal logic. This allows a sysadmin to add custom logic to HTCondor internals without resorting to writing C++ code. The hook in question is the job transformation step for the JobRouter.

The problem with hooks is they are surprisingly difficult to write. For the transform hook, a job's ClassAd is written to the script's stdin and the JobRouter expects to read the transformed ClassAd from stdout. [Actually, it's a touch more complicated than that, but this simplification will do for our discussion.] ClassAds are an expressive and powerful language - but a language difficult to parse via Unix scripting! There are complex quoting and attribute evaluation rules.

Sysadmins are left with a decision - either spend quite some time implementing a ClassAd parser or only do the bare minimum and hope no one submits a complex ClassAd. I found the situation unsatisfactory and decided to write python bindings for the ClassAd library.

I found the endeavor fairly straightforward using the Boost.Python library, and ended up with a new GitHub project. Now, a job transform hook is as simple as this:

#!/usr/bin/python

import sys
import classad

route_ad = classad.ClassAd(sys.stdin.readline())
separator_line = sys.stdin.readline()
assert separator_line == "------\n"
ad = classad.parseOld(sys.stdin)

ad["Universe"] = 5
ad["GridResource"] = "condor localhost localhost"
if "x509UserProxyFirstFQAN" in ad and "/cms" in ad.eval("x509UserProxyFirstFQAN"):
    ad["AccountingGroup"] = "cms.%s" % ad.eval("Owner")
else:
    ad["AccountingGroup"] = "other.%s" % ad.eval("Owner")

print ad.printOld(),

The above script will read the ad from stdin and change the AccountingGroup
attribute based on the contents of the x509UserProxyFirstFQAN attribute.

Note ClassAds can be constructed from a string or a file object. Each ad can be treated like a python dictionary. Literals are converted to the equivalent python objects; expressions are exposed as objects. For example:

>>> import classad
>>> ad = classad.ClassAd()
>>> expr = classad.ExprTree("2+2")
>>> ad["foo"] = expr
>>> print ad["foo"]
2 + 2
>>> print ad["foo"].eval()
4

Most of the functionality is exposed; see the GitHub project for examples and unit tests. To make the C++ library safe to export to python, some minor semantics have been changed. Sub-ClassAds and Lists are not yet available via python, but shouldn't be too hard to add.

ClassAd Python bindings - maybe not the most life-changing software project in the world. However, they have potential to become one of life's little pleasures for those of us who deal with HTCondor every day!

Saturday, March 10, 2012

Resource Isolation in Condor using cgroups

This is the last in my series on job isolation techniques. It has spanned in postings over the last month, so it may help to recap:

Part I covered process isolation, prevent processes in one job from interacting with other jobs. This has been achievable through POSIX mechanisms for awhile, but the new PID namespaces mechanisms provide improved isolation for jobs running as the same user.
Part II and Part III discussed file isolation using bind mounts and chroots. Condor uses bind mounts to remove access to "problematic" directories such as /tmp. While more complex to setup, chroots allow jobs to run in a completely separate environment as the host and further isolates the job sandbox.

This post will cover resource isolation: preventing jobs from consuming system resources promised to another job.

Condor has always had some crude form of resource isolation. For example, the worker node could be configured to detect when the processes in a job have more CPU time than walltime (a rough indication that more than one core is being used) or when the sum of each process's virtual memory size exceeds the memory requested for the job. When Condor detects too many resources are being consumed, it can take an action such as suspending or killing the job.

This traditional approach is relatively unsatisfactory for a few reasons:

Condor periodically polls to view resource consumption. Any activity between polls is unmonitored.
The metrics Condor traditionally monitors are limited to memory and CPU, where the memory metrics are poor quality for complex jobs. The sum many process's virtual memory size, on a modern Linux box, has little correlation with RAM used and is not particularly meaningful.
We can do little with the system besides detect when resource limits have been violated and kill the job.

We cannot, for example, simply instruct the kernel to reduce the job's memory or CPU usage.
Accordingly, users must ask for peak resource usage, which may be well-above average resource usage, decreasing overall throughput. If the job needs 2GB on average but 4GB for a single second, the user will ask for 4GB; the other 2GB will be un-utilized.

In Linux, the oldest form of resource isolation is processor affinity or CPU pinning: a job can be locked to a specific CPU, and all its processes will inherit the affinity. Because two jobs are locked to separate CPUs, they will never consume each others' CPU resources. CPU pinning is unsatisfactory for reasons similar to memory: jobs can't utilize otherwise-idle CPUs, decreasing potential system throughput. The granularity is also poor: you can't evenly fairshare 25 jobs on a machine with 24 cores as each job must be locked to at least one core. However, it's a step forward - you don't need to kill jobs for using too much CPU - and present in Condor since 7.3.

Newer Linux kernels support cgroups, which allow are structures for managing groups of processes, and provide controllers for managing resources in each cgroup. In Condor 7.7.0, cgroup support was added for measuring resource usage. When enabled, Condor will place each job into a dedicated cgroup for the block-I/O, memory, CPU, and "freezer" controllers. We have implemented two new limiting mechanisms based on the memory and CPU controllers.

The CPU controller provides a mechanism for fairsharing between different cgroups. CPU shares are assigned to jobs based on the "slot weight" (by default, equal to the number of cores the job requested). Thus, a job asking for 2 cores will get an average of 2 cores on a fully loaded system. If there's an idle CPU, it could utilize more than 2 cores; however, it will never get less than what it requested for a significant amount of time. CPU fairsharing provides a much finer granularity than pinning, easily allowing the jobs-to-cores ratio be non-integer.

The memory controller provides two kinds of limits: soft and hard. When soft limits are in place, the job can use an arbitrary amount of RAM until the host runs out of memory (and starts to swap); when this happens, only jobs over their limit are swapped out. With hard limits, the job immediately starts swapping once it hits its RAM limit, regardless of the amount of free memory. Both soft and hard limits default to the amount of memory requested for the job.

Both methods also have disadvantages. Soft limits can cause "well-behaved" processes to wait while the OS frees up RAM from "badly behaving" process. Hard limits can cause large amounts of swapping (for example, if there's a memory leak), decreasing the entire node's disk performance and thus adversely affecting other jobs. In fact, it may be a better use of resources to preempt a heavily-swapping process and reschedule it on another node than let it continue running. There is further room for improvement here in the future.

Regardless, cgroups and controllers provide a solid improvement in resource isolation for Condor, and finish up our series on job isolation. Thanks for reading!

Monday, February 27, 2012

Improving File Isolation with chroot

In the last post, we examined a new Condor feature called MOUNT_UNDER_SCRATCH that will isolate jobs from each other on the file system by making world-writable directories (such as /tmp and /var/tmp) be unique and isolated per-batch-job.

That work started with the assumption that jobs from the same Unix user don't need to be isolated from each other. This isn't necessarily true on the grid: a single, shared account per-VO is still popular on the OSG. For such VOs, an attacker can gain additional credentials by reading the sandbox of each job running under the same Unix username.

To combat proxy-stealing, we use an old Linux trick called a "chroot". A sysadmin can create a complete copy of the OS inside a directory, and an appropriately-privileged process can change the root of its filesystem ("/") to that directory. In fact, the phrase "changing root" where we get the "chroot" terminology.

For example, suppose the root of the system looks like this:

[root@localhost ~]# ls /
bin     cvmfs         hadoop-data2  home        media  opt   selinux  usr
boot    dev           hadoop-data3  lib         misc   proc  srv      var
cgroup  etc           hadoop-data4  lib64       mnt    root  sys
chroot  hadoop-data1  hadoop.log    lost+found  net    sbin  tmp

The sysadmin can create a copy of the RHEL5 operating system inside a sub-directory; at our site, this is /chroot/sl5-v3/root:

[root@localhost ~]# ls /chroot/sl5-v3/root/
bin   cvmfs  etc   lib    media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt    proc  sbin  srv      tmp  var

Note how the contents of the chroot directory are stripped down relative to the OS - we can remove dangerous binaries, sensitive configurations, or anything else unnecessary to running a job. For example, many common Linux privilege escalation exploits come from the presence of a setuid binary. Such binaries (at, cron, ping) are necessary for managing the host, but not necessary for a running job. By eliminating the setuid binaries from the chroot, a sysadmin can eliminate a common attack vector for processes running inside.

Once the directory is built, we can call chroot and isolate ourselves from the host:

[root@red-d15n6 ~]# chroot /chroot/sl5-v3/root/
bash-3.2# ls /
bin   cvmfs  etc   lib   media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt  proc  sbin  srv      tmp  var

Condor, as of 7.7.5, now knows how to invoke the chroot syscall for user jobs. However, as the job sandbox is written outside the chroot, we must somehow transport it inside before starting the job. Bind mounts - discussed last time - come to our rescue. The entire process goes something like this:

Condor, as root, forks off a new child process.
The child uses the unshare system call to place itself in a new filesystem namespace.
The child calls mount to bind-mount the job sandbox inside the chroot. Any other bind mounts - such as /tmp or /var/tmp - are done at this time.
The child will invoke the chroot system call specifying the directory the sysadmin has configured.
The child drops privileges to the target batch system user, then calls exec to start the user process.

With this patch applied, Condor will copy only the job's sandbox forward into the filesystem namespace, meaning the job has access to no other sandbox (as all other sandboxes live outside the private namespace). This successfully isolates jobs from each other's sandboxes, even if they run under the same Unix user!

The Condor feature is referred to as NAMED_CHROOT, as sysadmins can created multiple chroot-capable directories, give them a user-friendly name (such as RHEL5, as opposed to /chroot/sl5-v3/root), and allow user jobs to ask for the directory by the friendly name in their submit file.

In addition to the security benefits, we have found the NAMED_CHROOT feature allows us to run a RHEL5 job on a RHEL6 host without using virtualization; something for the future.

Going back to our original list of directories needing isolation - system temporary directories, job sandbox, shared filesystems, and GRAM directories - we have now isolated everything except the shared filesystems. The option here is simple, if unpleasant: mount the shared file system as read-only. This is the modus operandi for $OSG_APP at many sites, and an acceptable (but not recommended) way to run $OSG_DATA (as $OSG_DATA is optional anyway). It restricts the functionality for the user, but brings us a step closer to our goal of job isolation.

After file isolation, we have one thing left: resource isolation. Again, a topic for the future.

Monday, February 20, 2012

File Isolation using bind mounts and chroots

The last post ended with a new technique for process-level isolation that unlocks our ability to safely use anonymous accounts and group accounts.

However, that's not "safe enough" for us: the jobs can still interact with each other via the file system. This post examines the directories where jobs can write into, and what can be done to remove this access.

On a typical batch system node, a user can write into the following directories:

System temporary directories: The Linux Filesystem Hierarchy Standard (FHS) provides at least two sticky, world-writable directories, /tmp and /var/tmp. These directories are traditionally unmanaged (user processes can write an uncontrolled amount of data here) and a security issue (symlink attacks and information leaks), even when user separation is in place.
Job Sandbox: This is a directory created by the batch system as a scratch location for the job. The contents of the directory will be cleaned out by the batch system after the job ends. For Condor, any user proxy, executable, or job stage-in files will be copied here prior to the job starting.
Shared Filesystems: For a non-grid site, this is typically at least $HOME, and some other site-specific directory. $HOME is owned by the user running the job. On the OSG, we also have $OSG_APP for application installation (typically read-only for worker nodes) and, optionally, $OSG_DATA for data staging (writable for worker nodes). If they exist and are writable, $OSG_APP/DATA are owned by root and marked as sticky.
GRAM directories: For non-Condor OSG sites, a few user-writable directories are needed to transfer the executable, proxy, and job stage-in files from the gatekeeper to the worker node. These default to $HOME, but can be relocated to any shared filesystem directory. For Condor-based OSG sites, this is a part of the job sandbox.

If user separation is in place and considered sufficient, filesystem isolation is taken care of for shared filesystems, GRAM directories, and the job sandbox. The systemwide temporary directories can be protected by mixing filesystem namespaces and bind mounts.

A process can be launched in its own filesystem namespace; such a process will have a copy of the system mount table. Any change made to the process's mount table will not be seen by the outside system, and will be shared with any child processes.

For example, if the user's home directory is not mounted on the host, the batch system could create a process in a new filesystem namespace and mount the home directory in that namespace. The home directory will be available to the batch job, but to no other process on the filesystem.

When the last process in the filesystem namespace exits, all mounts that are unique to that namespace will be unmounted. In our example, when the batch job exits, the kernel will unmount the home directory.

A bind mount makes a file or directory visible at another place in the filesystem - I think of it as mirroring the directory elsewhere. We can take the job sandbox directory, create a sub-directory, and bind-mount the sub-directory over /tmp. The process is mostly equivalent to the following shell commands (where $_CONDOR_SCRATCH_DIR is the location of the Condor job sandbox) in a filesystem namespace:

mkdir $_CONDOR_SCRATCH_DIR/tmp
mount --bind $_CONDOR_SCRATCH_DIR/tmp /tmp

Afterward, any files a process creates in /tmp will actually be stored in $_CONDOR_SCRATCH_DIR/tmp - and cleaned up accordingly by Condor on job exit. Any system process not in the job will not be able to see or otherwise interfere with the contents of the job's /tmp unless it can write into $_CONDOR_SCRATCH_DIR.

Condor refers to this feature as MOUNT_UNDER_SCRATCH, and will be a part of the 7.7.5 release. This will be an admin-specified list of directories on the worker node. With it, the job will have a private copy of these directories, which will be backed by $_CONDOR_SCRATCH_DIR. The contents - and size - of these will be managed by Condor, just like anything else in the scratch directory.

If user separation is unavailable or not considered sufficient (if there are, for example, group accounts), an additional layer of isolation is needed to protect the job sandbox. A topic for a future day!

Tuesday, February 14, 2012

Job Isolation in Condor

I'd like to share a few exciting new features under construction for Condor 7.7.6 (or 7.9.0, as it may be).

I've been working hard to improve the job isolation techniques available in Condor. My dictionary defines the verb "to isolate" as "to be or remain alone or apart from others"; when applied to the Condor context, we'd like to isolate each job from the others. We'll define process isolation as the inability of a process running in a batch job to interfere with a process not a part of the job. Interfering with processes on Linux, loosely defined, means the sending of POSIX signals, taking control via the ptrace mechanism, or writing into the other process's memory.

Process isolation is only one aspect of job isolation. Job isolation also includes the inability to interfere with other jobs' files (file isolation) and not being able to consume others' system resources such as CPU, memory, or disk (resource isolation).

In Condor, process isolation has historically been accomplished via one of two mechanisms:

Submitting user. Jobs from Alice and Bob will be submitted as the unix users alice and bob, respectively. In this model, the jobs running on the worker node will be run as users alice and bob, respectively. The processes in the job running under user bob are protected from the processes in the job running as user alice via traditional POSIX security mechanisms.

This model makes the assumption that jobs submitted by the same user do not need isolation from each other. In other words, there shouldn't be any shared user accounts!
This model also assumes the submit host and the worker node share a common user namespace. This can be more difficult to accomplish than it sounds: if the submit host has thousands of unique users, we must make sure each functions on the worker node. If the submit host is on a remote site with a different user namespace from the worker node, this may not be easily achievable!

Per-slot users. Each "slot" (roughly corresponding to a CPU) in condor is assigned a unique unix user. The job currently running in that slot is run under the associated username.

This solves the "gotchas" noted above with the submitting user isolation model.
This is difficult to accomplish in-practice if the job wants to utilize a filesystem shared between the submit and worker nodes. The filesystem security is based on two users having distinct Unix user names; in this model, there's no way to mark your files as only readable by your own jobs.

Notice both techniques require on user isolation to accomplish process isolation. Condor has an oft-overlooked third mode:

Mapping remote users to nobody. In this mode, local users (where the site admin can define the meaning of "local") get mapped to the submit host usernames, but non-local users all get mapped to user nobody - the traditional unprivileged user on Linux.

Local users can access all their files, but remote users only get access to the batch resources - no shared file systems.

Unfortunately, this is not a very secure mode as, according to the manual, the nobody account "... may also be used by other Condor jobs running on the same machine, if it is a multi-processor machine"; not very handy advice in an age where your cell phone likely is a multi-processor machine!

This third mode is particularly attractive to us - we can avoid filesystem issues for our local users, but no longer have to create the thousands of accounts in our LDAP database for remote users. However, since jobs from remote users run under the same unix user account, the traditional security mechanism of user separation does not apply - we need a new technique!

Enter PID namespaces, a new separation technique introduced in kernel 2.6.24. By passing an additional flag when creating a new process, the kernel will assign an additional process ID (PID) to the child process. The child will believe itself to be PID 1 (that is, when the child calls getpid(), it returns 1), while the processes in the parent's namespace will see a different PID. The child will be able to spawn additional processes - all will be stuck in the same inner namespace - that similarly have an inner PID different from the outer one.

Processes within the namespace can only see and interfere (send signals, ptrace, etc) with other processes inside the namespace. By launching the new job in its own PID namespace, Condor can achieve process isolation without user isolation: the job processes are isolated from all other processes on the system.

Perhaps the best way to visualize the impact of PID namespaces in the job is to examine the output of ps:

[bbockelm@localhost condor]$ condor_run ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bbockelm     1  0.0  0.0 114132  1236 ?        SNs  11:42   0:00 /bin/bash /home/bbockelm/.condor_run.3672
bbockelm     2  0.0  0.0 115660  1080 ?        RN   11:42   0:00 ps faux

Only two processes can be seen from within the job - the shell executing the job script and "ps" itself.

Releasing a PID namespaces-enabled Condor is an ongoing effort: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1959; I've recently re-designed the patch to be far less intrusive on the Condor internals by switching from the glibc clone() call to the clone syscall. I am hopeful it will make it in the 7.7.6 / 7.9.0 timescale.

From a process isolation point-of-view, with this patch, it now is safe to run jobs as user "nobody" or re-introduce the idea of shared "group accounts". For example, we could map all CMS users to a single "cmsuser" account without having to worry about these becoming a vector for virus infection.

However, the story of job isolation does not end with PID namespaces. Stay tuned to find out how we are tackling file and resource isolation!