Friday, June 24, 2011

Part II: Keeping a mindful eye on your users with ProcPolice.

In Part I of this series, we talked about the various mechanisms a batch system uses to track your job's processes, and concluded the state of the art isn't particularly impressive.  The only way to go is up; this post discusses an improved technique for process tracking in Linux.  It was motivated by this blog post from the author of upstart.  If you feel inspired here, and would like to read some code, it is highly recommended reading.

The previous post went from bad to dire: most batch systems use methods that are easily defeated by the job changing its runtime environment (altering the process group, reparenting to init, or changing the environment). Even when using a reliable tracking method, killing an arbitrary set of processes is not possible.

To top it off - when process tracking or killing goes awry, we have no reliable means to detect when our methods fail.

There's a small, relatively unknown, corner of the Linux kernel that can help us out, the proc connector. A privileged process can connect a socket to the kernel, and receive a stream of messages about processes on the system. Any time one of the following system calls happens:
  • fork/clone
  • exec
  • exit
  • setuid
  • setgid
  • setsid
for a thread or a process (all the events are documented in linux/cn_proc.h in the kernel's sources), the socket receives a message containing all the relevant event details.  By tracking only the the fork and exit events, one can build a process tree in memory, starting with the batch system worker process.

Because it is based on events from the kernel, not periodic polling of /proc from user space, this is a far more reliable method for tracking processes.  With a little help from the kernel, the picture is already brighter!

The drawback here is that, while the message is being processed by user-space, further messages to the socket are buffered in memory.  When the buffer is full, the kernel drops any further messages: the tracker will lose possibly important events.  The event stream is asynchronous: the fork or exit occurs regardless of whether you process the associated message..  Unless the tracking code is particularly slow, it is likely the only case where the buffer overruns is the case where we care about: someone launching a fork bomb to escape the system.

If you have too many message, the first step is to receive less messages.  One can hand the kernel a small program in an special assembly language that pre-filters messages: a message that isn't put into the queue can't overflow it!  Writing these filters are a fun academic exercise, but not useful: when a fork bomb occurs, the messages the processes needs to receive are precisely the ones overflowing the buffer!

So while never failing is preferable, detecting when we have failed is acceptable: when the buffer is full, the next attempt to read a message from the socket will return a ENOBUF to indicate the buffer has overfilled.  Actions can be taken: for most batch systems, a nasty email to the user and sysadmin might be sufficient.  If you work for the NSA, perhaps the appropriate response is to power down the worker node and send out black helicopters.

I've taken the approach outlined here and turned it into a small package called "ProcPolice".  It consists of a simple daemon which listens to the stream of events, and adds the process to an in-memory tree if the process can be tracked back to a batch system job.  ProcPolice will detect when a process reparents to init and, if it is launched from the batch system and non-root, it can log the event or kill the newly-daemonized process.  In testing, it is able stop off simple fork-bombs and detect more sophisticated ones.

As ProcPolice runs as a separate daemon that watches the batch system and intervenes only on daemonized processes, it can be used immediately with any batch system (Condor or PBS have been tested).  ProcPolice is available in source code form from
Or as a RHEL5-compatible RPM.

ProcPolice was invented with a few specific requirements in mind:
  1. Prevent batch system-based processes from outliving the lifetime of the batch system job without changing the runtime of the job itself.
  2. Do this without support in the batch system itself.
  3. Detect when failures occur.
  4. Support RHEL5 (the OS used by the LHC for the next few years).
It turns out the last requirement is perhaps the most stringent one; newer kernels have a specific feature for tracking and controlling arbitrary sets of processes.  This is the topic of the next part of this series.

No comments:

Post a Comment