OSG Technology Area Rumblings: March 2012

This is the last in my series on job isolation techniques. It has spanned in postings over the last month, so it may help to recap:

Part I covered process isolation, prevent processes in one job from interacting with other jobs. This has been achievable through POSIX mechanisms for awhile, but the new PID namespaces mechanisms provide improved isolation for jobs running as the same user.
Part II and Part III discussed file isolation using bind mounts and chroots. Condor uses bind mounts to remove access to "problematic" directories such as /tmp. While more complex to setup, chroots allow jobs to run in a completely separate environment as the host and further isolates the job sandbox.

This post will cover resource isolation: preventing jobs from consuming system resources promised to another job.

Condor has always had some crude form of resource isolation. For example, the worker node could be configured to detect when the processes in a job have more CPU time than walltime (a rough indication that more than one core is being used) or when the sum of each process's virtual memory size exceeds the memory requested for the job. When Condor detects too many resources are being consumed, it can take an action such as suspending or killing the job.

This traditional approach is relatively unsatisfactory for a few reasons:

Condor periodically polls to view resource consumption. Any activity between polls is unmonitored.
The metrics Condor traditionally monitors are limited to memory and CPU, where the memory metrics are poor quality for complex jobs. The sum many process's virtual memory size, on a modern Linux box, has little correlation with RAM used and is not particularly meaningful.
We can do little with the system besides detect when resource limits have been violated and kill the job.

We cannot, for example, simply instruct the kernel to reduce the job's memory or CPU usage.
Accordingly, users must ask for peak resource usage, which may be well-above average resource usage, decreasing overall throughput. If the job needs 2GB on average but 4GB for a single second, the user will ask for 4GB; the other 2GB will be un-utilized.

In Linux, the oldest form of resource isolation is processor affinity or CPU pinning: a job can be locked to a specific CPU, and all its processes will inherit the affinity. Because two jobs are locked to separate CPUs, they will never consume each others' CPU resources. CPU pinning is unsatisfactory for reasons similar to memory: jobs can't utilize otherwise-idle CPUs, decreasing potential system throughput. The granularity is also poor: you can't evenly fairshare 25 jobs on a machine with 24 cores as each job must be locked to at least one core. However, it's a step forward - you don't need to kill jobs for using too much CPU - and present in Condor since 7.3.

Newer Linux kernels support cgroups, which allow are structures for managing groups of processes, and provide controllers for managing resources in each cgroup. In Condor 7.7.0, cgroup support was added for measuring resource usage. When enabled, Condor will place each job into a dedicated cgroup for the block-I/O, memory, CPU, and "freezer" controllers. We have implemented two new limiting mechanisms based on the memory and CPU controllers.

The CPU controller provides a mechanism for fairsharing between different cgroups. CPU shares are assigned to jobs based on the "slot weight" (by default, equal to the number of cores the job requested). Thus, a job asking for 2 cores will get an average of 2 cores on a fully loaded system. If there's an idle CPU, it could utilize more than 2 cores; however, it will never get less than what it requested for a significant amount of time. CPU fairsharing provides a much finer granularity than pinning, easily allowing the jobs-to-cores ratio be non-integer.

The memory controller provides two kinds of limits: soft and hard. When soft limits are in place, the job can use an arbitrary amount of RAM until the host runs out of memory (and starts to swap); when this happens, only jobs over their limit are swapped out. With hard limits, the job immediately starts swapping once it hits its RAM limit, regardless of the amount of free memory. Both soft and hard limits default to the amount of memory requested for the job.

Both methods also have disadvantages. Soft limits can cause "well-behaved" processes to wait while the OS frees up RAM from "badly behaving" process. Hard limits can cause large amounts of swapping (for example, if there's a memory leak), decreasing the entire node's disk performance and thus adversely affecting other jobs. In fact, it may be a better use of resources to preempt a heavily-swapping process and reschedule it on another node than let it continue running. There is further room for improvement here in the future.

Regardless, cgroups and controllers provide a solid improvement in resource isolation for Condor, and finish up our series on job isolation. Thanks for reading!

OSG Technology Area Rumblings

Saturday, March 10, 2012

Resource Isolation in Condor using cgroups