OSG Technology Area Rumblings: February 2012

Monday, February 27, 2012

Improving File Isolation with chroot

In the last post, we examined a new Condor feature called MOUNT_UNDER_SCRATCH that will isolate jobs from each other on the file system by making world-writable directories (such as /tmp and /var/tmp) be unique and isolated per-batch-job.

That work started with the assumption that jobs from the same Unix user don't need to be isolated from each other. This isn't necessarily true on the grid: a single, shared account per-VO is still popular on the OSG. For such VOs, an attacker can gain additional credentials by reading the sandbox of each job running under the same Unix username.

To combat proxy-stealing, we use an old Linux trick called a "chroot". A sysadmin can create a complete copy of the OS inside a directory, and an appropriately-privileged process can change the root of its filesystem ("/") to that directory. In fact, the phrase "changing root" where we get the "chroot" terminology.

For example, suppose the root of the system looks like this:

[root@localhost ~]# ls /
bin     cvmfs         hadoop-data2  home        media  opt   selinux  usr
boot    dev           hadoop-data3  lib         misc   proc  srv      var
cgroup  etc           hadoop-data4  lib64       mnt    root  sys
chroot  hadoop-data1  hadoop.log    lost+found  net    sbin  tmp

The sysadmin can create a copy of the RHEL5 operating system inside a sub-directory; at our site, this is /chroot/sl5-v3/root:

[root@localhost ~]# ls /chroot/sl5-v3/root/
bin   cvmfs  etc   lib    media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt    proc  sbin  srv      tmp  var

Note how the contents of the chroot directory are stripped down relative to the OS - we can remove dangerous binaries, sensitive configurations, or anything else unnecessary to running a job. For example, many common Linux privilege escalation exploits come from the presence of a setuid binary. Such binaries (at, cron, ping) are necessary for managing the host, but not necessary for a running job. By eliminating the setuid binaries from the chroot, a sysadmin can eliminate a common attack vector for processes running inside.

Once the directory is built, we can call chroot and isolate ourselves from the host:

[root@red-d15n6 ~]# chroot /chroot/sl5-v3/root/
bash-3.2# ls /
bin   cvmfs  etc   lib   media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt  proc  sbin  srv      tmp  var

Condor, as of 7.7.5, now knows how to invoke the chroot syscall for user jobs. However, as the job sandbox is written outside the chroot, we must somehow transport it inside before starting the job. Bind mounts - discussed last time - come to our rescue. The entire process goes something like this:

Condor, as root, forks off a new child process.
The child uses the unshare system call to place itself in a new filesystem namespace.
The child calls mount to bind-mount the job sandbox inside the chroot. Any other bind mounts - such as /tmp or /var/tmp - are done at this time.
The child will invoke the chroot system call specifying the directory the sysadmin has configured.
The child drops privileges to the target batch system user, then calls exec to start the user process.

With this patch applied, Condor will copy only the job's sandbox forward into the filesystem namespace, meaning the job has access to no other sandbox (as all other sandboxes live outside the private namespace). This successfully isolates jobs from each other's sandboxes, even if they run under the same Unix user!

The Condor feature is referred to as NAMED_CHROOT, as sysadmins can created multiple chroot-capable directories, give them a user-friendly name (such as RHEL5, as opposed to /chroot/sl5-v3/root), and allow user jobs to ask for the directory by the friendly name in their submit file.

In addition to the security benefits, we have found the NAMED_CHROOT feature allows us to run a RHEL5 job on a RHEL6 host without using virtualization; something for the future.

Going back to our original list of directories needing isolation - system temporary directories, job sandbox, shared filesystems, and GRAM directories - we have now isolated everything except the shared filesystems. The option here is simple, if unpleasant: mount the shared file system as read-only. This is the modus operandi for $OSG_APP at many sites, and an acceptable (but not recommended) way to run $OSG_DATA (as $OSG_DATA is optional anyway). It restricts the functionality for the user, but brings us a step closer to our goal of job isolation.

After file isolation, we have one thing left: resource isolation. Again, a topic for the future.

Monday, February 20, 2012

File Isolation using bind mounts and chroots

The last post ended with a new technique for process-level isolation that unlocks our ability to safely use anonymous accounts and group accounts.

However, that's not "safe enough" for us: the jobs can still interact with each other via the file system. This post examines the directories where jobs can write into, and what can be done to remove this access.

On a typical batch system node, a user can write into the following directories:

System temporary directories: The Linux Filesystem Hierarchy Standard (FHS) provides at least two sticky, world-writable directories, /tmp and /var/tmp. These directories are traditionally unmanaged (user processes can write an uncontrolled amount of data here) and a security issue (symlink attacks and information leaks), even when user separation is in place.
Job Sandbox: This is a directory created by the batch system as a scratch location for the job. The contents of the directory will be cleaned out by the batch system after the job ends. For Condor, any user proxy, executable, or job stage-in files will be copied here prior to the job starting.
Shared Filesystems: For a non-grid site, this is typically at least $HOME, and some other site-specific directory. $HOME is owned by the user running the job. On the OSG, we also have $OSG_APP for application installation (typically read-only for worker nodes) and, optionally, $OSG_DATA for data staging (writable for worker nodes). If they exist and are writable, $OSG_APP/DATA are owned by root and marked as sticky.
GRAM directories: For non-Condor OSG sites, a few user-writable directories are needed to transfer the executable, proxy, and job stage-in files from the gatekeeper to the worker node. These default to $HOME, but can be relocated to any shared filesystem directory. For Condor-based OSG sites, this is a part of the job sandbox.

If user separation is in place and considered sufficient, filesystem isolation is taken care of for shared filesystems, GRAM directories, and the job sandbox. The systemwide temporary directories can be protected by mixing filesystem namespaces and bind mounts.

A process can be launched in its own filesystem namespace; such a process will have a copy of the system mount table. Any change made to the process's mount table will not be seen by the outside system, and will be shared with any child processes.

For example, if the user's home directory is not mounted on the host, the batch system could create a process in a new filesystem namespace and mount the home directory in that namespace. The home directory will be available to the batch job, but to no other process on the filesystem.

When the last process in the filesystem namespace exits, all mounts that are unique to that namespace will be unmounted. In our example, when the batch job exits, the kernel will unmount the home directory.

A bind mount makes a file or directory visible at another place in the filesystem - I think of it as mirroring the directory elsewhere. We can take the job sandbox directory, create a sub-directory, and bind-mount the sub-directory over /tmp. The process is mostly equivalent to the following shell commands (where $_CONDOR_SCRATCH_DIR is the location of the Condor job sandbox) in a filesystem namespace:

mkdir $_CONDOR_SCRATCH_DIR/tmp
mount --bind $_CONDOR_SCRATCH_DIR/tmp /tmp

Afterward, any files a process creates in /tmp will actually be stored in $_CONDOR_SCRATCH_DIR/tmp - and cleaned up accordingly by Condor on job exit. Any system process not in the job will not be able to see or otherwise interfere with the contents of the job's /tmp unless it can write into $_CONDOR_SCRATCH_DIR.

Condor refers to this feature as MOUNT_UNDER_SCRATCH, and will be a part of the 7.7.5 release. This will be an admin-specified list of directories on the worker node. With it, the job will have a private copy of these directories, which will be backed by $_CONDOR_SCRATCH_DIR. The contents - and size - of these will be managed by Condor, just like anything else in the scratch directory.

If user separation is unavailable or not considered sufficient (if there are, for example, group accounts), an additional layer of isolation is needed to protect the job sandbox. A topic for a future day!

Tuesday, February 14, 2012

Job Isolation in Condor

I'd like to share a few exciting new features under construction for Condor 7.7.6 (or 7.9.0, as it may be).

I've been working hard to improve the job isolation techniques available in Condor. My dictionary defines the verb "to isolate" as "to be or remain alone or apart from others"; when applied to the Condor context, we'd like to isolate each job from the others. We'll define process isolation as the inability of a process running in a batch job to interfere with a process not a part of the job. Interfering with processes on Linux, loosely defined, means the sending of POSIX signals, taking control via the ptrace mechanism, or writing into the other process's memory.

Process isolation is only one aspect of job isolation. Job isolation also includes the inability to interfere with other jobs' files (file isolation) and not being able to consume others' system resources such as CPU, memory, or disk (resource isolation).

In Condor, process isolation has historically been accomplished via one of two mechanisms:

Submitting user. Jobs from Alice and Bob will be submitted as the unix users alice and bob, respectively. In this model, the jobs running on the worker node will be run as users alice and bob, respectively. The processes in the job running under user bob are protected from the processes in the job running as user alice via traditional POSIX security mechanisms.

This model makes the assumption that jobs submitted by the same user do not need isolation from each other. In other words, there shouldn't be any shared user accounts!
This model also assumes the submit host and the worker node share a common user namespace. This can be more difficult to accomplish than it sounds: if the submit host has thousands of unique users, we must make sure each functions on the worker node. If the submit host is on a remote site with a different user namespace from the worker node, this may not be easily achievable!

Per-slot users. Each "slot" (roughly corresponding to a CPU) in condor is assigned a unique unix user. The job currently running in that slot is run under the associated username.

This solves the "gotchas" noted above with the submitting user isolation model.
This is difficult to accomplish in-practice if the job wants to utilize a filesystem shared between the submit and worker nodes. The filesystem security is based on two users having distinct Unix user names; in this model, there's no way to mark your files as only readable by your own jobs.

Notice both techniques require on user isolation to accomplish process isolation. Condor has an oft-overlooked third mode:

Mapping remote users to nobody. In this mode, local users (where the site admin can define the meaning of "local") get mapped to the submit host usernames, but non-local users all get mapped to user nobody - the traditional unprivileged user on Linux.

Local users can access all their files, but remote users only get access to the batch resources - no shared file systems.

Unfortunately, this is not a very secure mode as, according to the manual, the nobody account "... may also be used by other Condor jobs running on the same machine, if it is a multi-processor machine"; not very handy advice in an age where your cell phone likely is a multi-processor machine!

This third mode is particularly attractive to us - we can avoid filesystem issues for our local users, but no longer have to create the thousands of accounts in our LDAP database for remote users. However, since jobs from remote users run under the same unix user account, the traditional security mechanism of user separation does not apply - we need a new technique!

Enter PID namespaces, a new separation technique introduced in kernel 2.6.24. By passing an additional flag when creating a new process, the kernel will assign an additional process ID (PID) to the child process. The child will believe itself to be PID 1 (that is, when the child calls getpid(), it returns 1), while the processes in the parent's namespace will see a different PID. The child will be able to spawn additional processes - all will be stuck in the same inner namespace - that similarly have an inner PID different from the outer one.

Processes within the namespace can only see and interfere (send signals, ptrace, etc) with other processes inside the namespace. By launching the new job in its own PID namespace, Condor can achieve process isolation without user isolation: the job processes are isolated from all other processes on the system.

Perhaps the best way to visualize the impact of PID namespaces in the job is to examine the output of ps:

[bbockelm@localhost condor]$ condor_run ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bbockelm     1  0.0  0.0 114132  1236 ?        SNs  11:42   0:00 /bin/bash /home/bbockelm/.condor_run.3672
bbockelm     2  0.0  0.0 115660  1080 ?        RN   11:42   0:00 ps faux

Only two processes can be seen from within the job - the shell executing the job script and "ps" itself.

Releasing a PID namespaces-enabled Condor is an ongoing effort: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1959; I've recently re-designed the patch to be far less intrusive on the Condor internals by switching from the glibc clone() call to the clone syscall. I am hopeful it will make it in the 7.7.6 / 7.9.0 timescale.

From a process isolation point-of-view, with this patch, it now is safe to run jobs as user "nobody" or re-introduce the idea of shared "group accounts". For example, we could map all CMS users to a single "cmsuser" account without having to worry about these becoming a vector for virus infection.

However, the story of job isolation does not end with PID namespaces. Stay tuned to find out how we are tackling file and resource isolation!