OSG Technology Area Rumblings: Job Isolation in Condor

I'd like to share a few exciting new features under construction for Condor 7.7.6 (or 7.9.0, as it may be).

I've been working hard to improve the job isolation techniques available in Condor. My dictionary defines the verb "to isolate" as "to be or remain alone or apart from others"; when applied to the Condor context, we'd like to isolate each job from the others. We'll define process isolation as the inability of a process running in a batch job to interfere with a process not a part of the job. Interfering with processes on Linux, loosely defined, means the sending of POSIX signals, taking control via the ptrace mechanism, or writing into the other process's memory.

Process isolation is only one aspect of job isolation. Job isolation also includes the inability to interfere with other jobs' files (file isolation) and not being able to consume others' system resources such as CPU, memory, or disk (resource isolation).

In Condor, process isolation has historically been accomplished via one of two mechanisms:

Submitting user. Jobs from Alice and Bob will be submitted as the unix users alice and bob, respectively. In this model, the jobs running on the worker node will be run as users alice and bob, respectively. The processes in the job running under user bob are protected from the processes in the job running as user alice via traditional POSIX security mechanisms.

This model makes the assumption that jobs submitted by the same user do not need isolation from each other. In other words, there shouldn't be any shared user accounts!
This model also assumes the submit host and the worker node share a common user namespace. This can be more difficult to accomplish than it sounds: if the submit host has thousands of unique users, we must make sure each functions on the worker node. If the submit host is on a remote site with a different user namespace from the worker node, this may not be easily achievable!

Per-slot users. Each "slot" (roughly corresponding to a CPU) in condor is assigned a unique unix user. The job currently running in that slot is run under the associated username.

This solves the "gotchas" noted above with the submitting user isolation model.
This is difficult to accomplish in-practice if the job wants to utilize a filesystem shared between the submit and worker nodes. The filesystem security is based on two users having distinct Unix user names; in this model, there's no way to mark your files as only readable by your own jobs.

Notice both techniques require on user isolation to accomplish process isolation. Condor has an oft-overlooked third mode:

Mapping remote users to nobody. In this mode, local users (where the site admin can define the meaning of "local") get mapped to the submit host usernames, but non-local users all get mapped to user nobody - the traditional unprivileged user on Linux.

Local users can access all their files, but remote users only get access to the batch resources - no shared file systems.

Unfortunately, this is not a very secure mode as, according to the manual, the nobody account "... may also be used by other Condor jobs running on the same machine, if it is a multi-processor machine"; not very handy advice in an age where your cell phone likely is a multi-processor machine!

This third mode is particularly attractive to us - we can avoid filesystem issues for our local users, but no longer have to create the thousands of accounts in our LDAP database for remote users. However, since jobs from remote users run under the same unix user account, the traditional security mechanism of user separation does not apply - we need a new technique!

Enter PID namespaces, a new separation technique introduced in kernel 2.6.24. By passing an additional flag when creating a new process, the kernel will assign an additional process ID (PID) to the child process. The child will believe itself to be PID 1 (that is, when the child calls getpid(), it returns 1), while the processes in the parent's namespace will see a different PID. The child will be able to spawn additional processes - all will be stuck in the same inner namespace - that similarly have an inner PID different from the outer one.

Processes within the namespace can only see and interfere (send signals, ptrace, etc) with other processes inside the namespace. By launching the new job in its own PID namespace, Condor can achieve process isolation without user isolation: the job processes are isolated from all other processes on the system.

Perhaps the best way to visualize the impact of PID namespaces in the job is to examine the output of ps:

[bbockelm@localhost condor]$ condor_run ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bbockelm     1  0.0  0.0 114132  1236 ?        SNs  11:42   0:00 /bin/bash /home/bbockelm/.condor_run.3672
bbockelm     2  0.0  0.0 115660  1080 ?        RN   11:42   0:00 ps faux

Only two processes can be seen from within the job - the shell executing the job script and "ps" itself.

Releasing a PID namespaces-enabled Condor is an ongoing effort: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1959; I've recently re-designed the patch to be far less intrusive on the Condor internals by switching from the glibc clone() call to the clone syscall. I am hopeful it will make it in the 7.7.6 / 7.9.0 timescale.

From a process isolation point-of-view, with this patch, it now is safe to run jobs as user "nobody" or re-introduce the idea of shared "group accounts". For example, we could map all CMS users to a single "cmsuser" account without having to worry about these becoming a vector for virus infection.

However, the story of job isolation does not end with PID namespaces. Stay tuned to find out how we are tackling file and resource isolation!

OSG Technology Area Rumblings

Tuesday, February 14, 2012

Job Isolation in Condor

No comments:

Post a Comment