Monday, February 20, 2012

File Isolation using bind mounts and chroots

The last post ended with a new technique for process-level isolation that unlocks our ability to safely use anonymous accounts and group accounts.

However, that's not "safe enough" for us: the jobs can still interact with each other via the file system.  This post examines the directories where jobs can write into, and what can be done to remove this access.

On a typical batch system node, a user can write into the following directories:

  • System temporary directories: The Linux Filesystem Hierarchy Standard (FHS) provides at least two sticky, world-writable directories, /tmp and /var/tmp.  These directories are traditionally unmanaged (user processes can write an uncontrolled amount of data here) and a security issue (symlink attacks and information leaks), even when user separation is in place.
  • Job Sandbox: This is a directory created by the batch system as a scratch location for the job.  The contents of the directory will be cleaned out by the batch system after the job ends.  For Condor, any user proxy, executable, or job stage-in files will be copied here prior to the job starting.
  • Shared Filesystems: For a non-grid site, this is typically at least $HOME, and some other site-specific directory.  $HOME is owned by the user running the job.  On the OSG, we also have $OSG_APP for application installation (typically read-only for worker nodes) and, optionally, $OSG_DATA for data staging (writable for worker nodes).  If they exist and are writable, $OSG_APP/DATA are owned by root and marked as sticky.
  • GRAM directories: For non-Condor OSG sites, a few user-writable directories are needed to transfer the executable, proxy, and job stage-in files from the gatekeeper to the worker node.  These default to $HOME, but can be relocated to any shared filesystem directory.  For Condor-based OSG sites, this is a part of the job sandbox.
If user separation is in place and considered sufficient, filesystem isolation is taken care of for shared filesystems, GRAM directories, and the job sandbox.  The systemwide temporary directories can be protected by mixing filesystem namespaces and bind mounts.

A process can be launched in its own filesystem namespace; such a process will have a copy of the system mount table.  Any change made to the process's mount table will not be seen by the outside system, and will be shared with any child processes.

For example, if the user's home directory is not mounted on the host, the batch system could create a process in a new filesystem namespace and mount the home directory in that namespace.  The home directory will be available to the batch job, but to no other process on the filesystem.

When the last process in the filesystem namespace exits, all mounts that are unique to that namespace will be unmounted.  In our example, when the batch job exits, the kernel will unmount the home directory.

A bind mount makes a file or directory visible at another place in the filesystem - I think of it as mirroring the directory elsewhere.  We can take the job sandbox directory, create a sub-directory, and bind-mount the sub-directory over /tmp.  The process is mostly equivalent to the following shell commands (where $_CONDOR_SCRATCH_DIR is the location of the Condor job sandbox) in a filesystem namespace:

mount --bind $_CONDOR_SCRATCH_DIR/tmp /tmp

Afterward, any files a process creates in /tmp will actually be stored in $_CONDOR_SCRATCH_DIR/tmp - and cleaned up accordingly by Condor on job exit.  Any system process not in the job will not be able to see or otherwise interfere with the contents of the job's /tmp unless it can write into $_CONDOR_SCRATCH_DIR.

Condor refers to this feature as MOUNT_UNDER_SCRATCH, and will be a part of the 7.7.5 release.  This will be an admin-specified list of directories on the worker node.  With it, the job will have a private copy of these directories, which will be backed by $_CONDOR_SCRATCH_DIR.  The contents - and size - of these will be managed by Condor, just like anything else in the scratch directory.

If user separation is unavailable or not considered sufficient (if there are, for example, group accounts), an additional layer of isolation is needed to protect the job sandbox.  A topic for a future day!

No comments:

Post a Comment