Monday, February 27, 2012

Improving File Isolation with chroot

In the last post, we examined a new Condor feature called MOUNT_UNDER_SCRATCH that will isolate jobs from each other on the file system by making world-writable directories (such as /tmp and /var/tmp) be unique and isolated per-batch-job.

That work started with the assumption that jobs from the same Unix user don't need to be isolated from each other.  This isn't necessarily true on the grid: a single, shared account per-VO is still popular on the OSG.  For such VOs, an attacker can gain additional credentials by reading the sandbox of each job running under the same Unix username.

To combat proxy-stealing, we use an old Linux trick called a "chroot".  A sysadmin can create a complete copy of the OS inside a directory, and an appropriately-privileged process can change the root of its filesystem ("/") to that directory.  In fact, the phrase "changing root" where we get the "chroot" terminology.

For example, suppose the root of the system looks like this:

[root@localhost ~]# ls /
bin     cvmfs         hadoop-data2  home        media  opt   selinux  usr
boot    dev           hadoop-data3  lib         misc   proc  srv      var
cgroup  etc           hadoop-data4  lib64       mnt    root  sys
chroot  hadoop-data1  hadoop.log    lost+found  net    sbin  tmp

The sysadmin can create a copy of the RHEL5 operating system inside a sub-directory; at our site, this is /chroot/sl5-v3/root:

[root@localhost ~]# ls /chroot/sl5-v3/root/
bin   cvmfs  etc   lib    media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt    proc  sbin  srv      tmp  var

Note how the contents of the chroot directory are stripped down relative to the OS - we can remove dangerous binaries, sensitive configurations, or anything else unnecessary to running a job.  For example, many common Linux privilege escalation exploits come from the presence of a setuid binary.  Such binaries (at, cron, ping) are necessary for managing the host, but not necessary for a running job.  By eliminating the setuid binaries from the chroot, a sysadmin can eliminate a common attack vector for processes running inside.

Once the directory is built, we can call chroot and isolate ourselves from the host:

[root@red-d15n6 ~]# chroot /chroot/sl5-v3/root/
bash-3.2# ls /
bin   cvmfs  etc   lib   media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt  proc  sbin  srv      tmp  var

Condor, as of 7.7.5, now knows how to invoke the chroot syscall for user jobs.  However, as the job sandbox is written outside the chroot, we must somehow transport it inside before starting the job.  Bind mounts - discussed last time - come to our rescue.  The entire process goes something like this:
  1. Condor, as root, forks off a new child process.
  2. The child uses the unshare system call to place itself in a new filesystem namespace.
  3. The child calls mount to bind-mount the job sandbox inside the chroot.  Any other bind mounts - such as /tmp or /var/tmp - are done at this time.
  4. The child will invoke the chroot system call specifying the directory the sysadmin has configured.
  5. The child drops privileges to the target batch system user, then calls exec to start the user process.
With this patch applied, Condor will copy only the job's sandbox forward into the filesystem namespace, meaning the job has access to no other sandbox (as all other sandboxes live outside the private namespace).  This successfully isolates jobs from each other's sandboxes, even if they run under the same Unix user!

The Condor feature is referred to as NAMED_CHROOT, as sysadmins can created multiple chroot-capable directories, give them a user-friendly name (such as RHEL5, as opposed to /chroot/sl5-v3/root), and allow user jobs to ask for the directory by the friendly name in their submit file.

In addition to the security benefits, we have found the NAMED_CHROOT feature allows us to run a RHEL5 job on a RHEL6 host without using virtualization; something for the future.

Going back to our original list of directories needing isolation - system temporary directories, job sandbox, shared filesystems, and GRAM directories - we have now isolated everything except the shared filesystems. The option here is simple, if unpleasant: mount the shared file system as read-only.  This is the modus operandi for $OSG_APP at many sites, and an acceptable (but not recommended) way to run $OSG_DATA (as $OSG_DATA is optional anyway).  It restricts the functionality for the user, but brings us a step closer to our goal of job isolation.

After file isolation, we have one thing left: resource isolation.  Again, a topic for the future.

No comments:

Post a Comment