Friday, July 8, 2011

Part III: Bulletproof process tracking with cgroups

Finally, it's time to provide a good solution for accomplishing process tracking in a Linux batch system.
If you recall in Part I, we surveyed common methods for process tracking and ultimately concluded that batch systems used userspace mechanisms (most of which were originally designed for shell-based process control, by the way) that were unreliable, or couldn't detect when failures occur.  In Part II, the picture brightened: the kernel provided an event feed about process births and deaths, and informed us when messages were dropped.

In this post, we'll talk about a new feature called "cgroups", short for "control groups".  Cgroups are a mechanism in the Linux kernel for managing a set of processes and all their descendents.  They are managed through a filesystem-like interface (in the manner of /proc); the directory structure expresses the fact they are hierarchical, and filesystem permissions can be used to restrict the set of users allowed to manipulate them.  By default, only root is allowed to manipulate control groups: unlike the process groups, process trees, and environment cookies examined before, a process typically has no ability to change its group.  Further, unlike the proc connector API, the control group is assigned synchronously by the kernel at process creation time.  Hence, fork-bombs are not an effective way to escape from the group.

While having the tracking done by the kernel is an immense improvement, the true power of cgroups become apparent through the use of multiple subsystems.  Different cgroup subsystems may act to control scheduler policy, allocate or limit resources, or account for usage.

For example, the memory controller can be used to limit the amount of memory used by a set of processes.  This is a huge improvement over the previous memory limit technique (rlimit), where the limit was assigned per-process.  With rlimit, you could limit a single process to 1GB, but the job would just spawn N processes of 1GB each, sidestepping your limits.  In the kernel shipped with Fedora 15, 10 controllers are active by default.  For more information, you can check the documentation:
If you are a Redhat customer, I find the RHEL6 manual has the best cgroups documentation out there.

To see cgroups in action, use the systemd-cgls command found on Fedora 15.  This will print out the current hierarchy of all cgroups.  Here's what I see on my system (output truncated for display reasons):

├ condor
│ ├ 17948 /usr/sbin/condor_master -f
│ ├ 17949 condor_collector -f
│ ├ 17950 condor_negotiator -f
│ ├ 17951 condor_schedd -f
│ ├ 17952 condor_startd -f
│ ├ 17953 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 48...
│ └ 18224 condor_procd -A /var/run/condor/procd_pipe.STARTD -R 10000000 -S 60 -C 48...
├ user
│ ├ root
│ │ └ master
│ │   └ 6879 bash
│ └ bbockelm
│   ├ 1168
│   │ ├ 21426 sshd: bbockelm [priv]
│   │ ├ 21429 sshd: bbockelm@pts/3
│   │ ├ 21430 -bash
│   │ └ 21530 systemd-cgls
│   ├ 309
│   │ ├  1110 /usr/libexec/gvfsd-http --spawner :1.4 /org/gtk/gvfs/exec_spaw/0
│   │ ├  6198 gnome-terminal
│   │ ├  6202 gnome-pty-helper 
(output trimmed) 
└ system
  ├ 1 /bin/systemd --log-level info --log-target syslog-or-kmsg --system --dump...
  ├ sendmail.service
  │ ├ 8603 sendmail: accepting connections
  │ └ 8612 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
  ├ auditd.service
  │ ├ 8542 auditd
  │ ├ 8544 /sbin/audispd
  │ └ 8552 /usr/sbin/sedispatch
  ├ sshd.service
  │ └ 7572 /usr/sbin/sshd 
(output trimmed)

All of the processes in my system are in the / cgroup; all login shells are placed inside a cgroup named
; each system service (such as ssh) is located inside a cgroup named
; finally, there's a special one named
; More on

To see the cgroups for the current process, you can do the following:
[bbockelm@mydesktop ~]$ cat /proc/self/cgroup 
Note that each processes is not necessarily in one cgroup. The rules are that a process can have one cgroup per mount, there is one or more controller per mount, and a controller can only be mounted once.

Each controller has statistics accessible via proc.  For example, on Fedora 15, if I want to see how much memory all of my login shells are using, I can do the following:

[bbockelm@rcf-bockelman ~]$ cat /cgroups/memory/condor/memory.usage_in_bytes 

But what about the batch system?
I hope our readers can see the immediate utility in having a simple mechanism for unescapable process tracking.  We examined one such mechanism before (adding a secondary GID per batch job), but it has a small drawback in that the secondary GID can be used to create permanent objects (files owned by the secondary GID) which outlive the lifetime of the batch job.

But, even in Part I of the series, we concluded that a perfect process tracking mechanism is not enough: we also need to be able to kill processes when the batch job is finished!  The cgroups developer must have come to the same conclusion, as one controller is called the freezer.  The freezer cgroup simply stops any process from receiving CPU time from the kernel.  All process in the cgroups are frozen - and there is no way for a process to know it is about to freeze, as they aren't informed via signals.  Hence, a process tracker can freeze the processes, send them all SIGKILL, and unfreeze them.  All processes will end immediately; none will have the ability to hide in the /proc system or spawn new children in a race condition.

If you look at the first process tree posted, there is a cgroup called "condor".  As I presented at Condor Week 2011, condor is now integrated with cgroups.  It can be started in a cgroup the sysadmin specifies (such as /condor), and it will create a unique cgroup for each job (/cgroup/job_$CLUSTERID_$PROC_ID).  It uses whatever controllers are active on the system to try and track memory consumption, CPU time, and block I/O.  When the job ends or is killed, the freezer controller is used to clean up any processes.

As the disparate scientific clusters have become increasingly linked through the use of grids, improved process tracking has become more important.  Many sites have users from across the nation; it's no longer possible for a sysadmin to be good friends with each user.  Some have jobs with questionable quality; some have with virus-ridden laptops.

In the end, traditional process tracking in batch systems is not really ready for modern users.  Most modern batch systems no longer rely solely on the original Unix grouping mechanisms, but will fall to user malicious users.  The problem is not solvable only from user space.

Luckily, with the proc connector API (for any Linux 2.6 kernel) and cgroups (for recent Kernels), we can greatly improve the state of the art.  The folks contributing to the Linux kernel is broad, but I understand much of the contributions for cgroups has come from the OpenVZ folks: thanks guys!.

As I've been exploring this subject, I have been implementing cgroup usage in Condor: I think it's a great new feature.  They will be released with Condor 7.7.0, due in a few days.  There's no reason other batch systems can't also adopt cgroups for process tracking: I hope the spread widely in the future!


  1. I'm not sure /cgroup/job_$CLUSTERID_$PROC_ID will be unique. When flocking from multiple hosts, it is possible, though rare, that CLUSTERID and PROC can be the same.

    I would use the GlobalJobId of the job, if available.

  2. @Derek: I think you're right. You should file a bug report about that...