That work started with the assumption that jobs from the same Unix user don't need to be isolated from each other. This isn't necessarily true on the grid: a single, shared account per-VO is still popular on the OSG. For such VOs, an attacker can gain additional credentials by reading the sandbox of each job running under the same Unix username.
To combat proxy-stealing, we use an old Linux trick called a "chroot". A sysadmin can create a complete copy of the OS inside a directory, and an appropriately-privileged process can change the root of its filesystem ("/") to that directory. In fact, the phrase "changing root" where we get the "chroot" terminology.
For example, suppose the root of the system looks like this:
[root@localhost ~]# ls / bin cvmfs hadoop-data2 home media opt selinux usr boot dev hadoop-data3 lib misc proc srv var cgroup etc hadoop-data4 lib64 mnt root sys chroot hadoop-data1 hadoop.log lost+found net sbin tmp
The sysadmin can create a copy of the RHEL5 operating system inside a sub-directory; at our site, this is /chroot/sl5-v3/root:
[root@localhost ~]# ls /chroot/sl5-v3/root/ bin cvmfs etc lib media opt root selinux sys usr boot dev home lib64 mnt proc sbin srv tmp var
Note how the contents of the chroot directory are stripped down relative to the OS - we can remove dangerous binaries, sensitive configurations, or anything else unnecessary to running a job. For example, many common Linux privilege escalation exploits come from the presence of a setuid binary. Such binaries (at, cron, ping) are necessary for managing the host, but not necessary for a running job. By eliminating the setuid binaries from the chroot, a sysadmin can eliminate a common attack vector for processes running inside.
Once the directory is built, we can call chroot and isolate ourselves from the host:
[root@red-d15n6 ~]# chroot /chroot/sl5-v3/root/ bash-3.2# ls / bin cvmfs etc lib media opt root selinux sys usr boot dev home lib64 mnt proc sbin srv tmp var
Condor, as of 7.7.5, now knows how to invoke the chroot syscall for user jobs. However, as the job sandbox is written outside the chroot, we must somehow transport it inside before starting the job. Bind mounts - discussed last time - come to our rescue. The entire process goes something like this:
- Condor, as root, forks off a new child process.
- The child uses the unshare system call to place itself in a new filesystem namespace.
- The child calls mount to bind-mount the job sandbox inside the chroot. Any other bind mounts - such as /tmp or /var/tmp - are done at this time.
- The child will invoke the chroot system call specifying the directory the sysadmin has configured.
- The child drops privileges to the target batch system user, then calls exec to start the user process.
With this patch applied, Condor will copy only the job's sandbox forward into the filesystem namespace, meaning the job has access to no other sandbox (as all other sandboxes live outside the private namespace). This successfully isolates jobs from each other's sandboxes, even if they run under the same Unix user!
The Condor feature is referred to as NAMED_CHROOT, as sysadmins can created multiple chroot-capable directories, give them a user-friendly name (such as RHEL5, as opposed to /chroot/sl5-v3/root), and allow user jobs to ask for the directory by the friendly name in their submit file.
In addition to the security benefits, we have found the NAMED_CHROOT feature allows us to run a RHEL5 job on a RHEL6 host without using virtualization; something for the future.
Going back to our original list of directories needing isolation - system temporary directories, job sandbox, shared filesystems, and GRAM directories - we have now isolated everything except the shared filesystems. The option here is simple, if unpleasant: mount the shared file system as read-only. This is the modus operandi for $OSG_APP at many sites, and an acceptable (but not recommended) way to run $OSG_DATA (as $OSG_DATA is optional anyway). It restricts the functionality for the user, but brings us a step closer to our goal of job isolation.
After file isolation, we have one thing left: resource isolation. Again, a topic for the future.