The previous post went from bad to dire: most batch systems use methods that are easily defeated by the job changing its runtime environment (altering the process group, reparenting to init, or changing the environment). Even when using a reliable tracking method, killing an arbitrary set of processes is not possible.
To top it off - when process tracking or killing goes awry, we have no reliable means to detect when our methods fail.
There's a small, relatively unknown, corner of the Linux kernel that can help us out, the proc connector. A privileged process can connect a socket to the kernel, and receive a stream of messages about processes on the system. Any time one of the following system calls happens:
Because it is based on events from the kernel, not periodic polling of /proc from user space, this is a far more reliable method for tracking processes. With a little help from the kernel, the picture is already brighter!
The drawback here is that, while the message is being processed by user-space, further messages to the socket are buffered in memory. When the buffer is full, the kernel drops any further messages: the tracker will lose possibly important events. The event stream is asynchronous: the fork or exit occurs regardless of whether you process the associated message.. Unless the tracking code is particularly slow, it is likely the only case where the buffer overruns is the case where we care about: someone launching a fork bomb to escape the system.
If you have too many message, the first step is to receive less messages. One can hand the kernel a small program in an special assembly language that pre-filters messages: a message that isn't put into the queue can't overflow it! Writing these filters are a fun academic exercise, but not useful: when a fork bomb occurs, the messages the processes needs to receive are precisely the ones overflowing the buffer!
So while never failing is preferable, detecting when we have failed is acceptable: when the buffer is full, the next attempt to read a message from the socket will return a ENOBUF to indicate the buffer has overfilled. Actions can be taken: for most batch systems, a nasty email to the user and sysadmin might be sufficient. If you work for the NSA, perhaps the appropriate response is to power down the worker node and send out black helicopters.
I've taken the approach outlined here and turned it into a small package called "ProcPolice". It consists of a simple daemon which listens to the stream of events, and adds the process to an in-memory tree if the process can be tracked back to a batch system job. ProcPolice will detect when a process reparents to init and, if it is launched from the batch system and non-root, it can log the event or kill the newly-daemonized process. In testing, it is able stop off simple fork-bombs and detect more sophisticated ones.
As ProcPolice runs as a separate daemon that watches the batch system and intervenes only on daemonized processes, it can be used immediately with any batch system (Condor or PBS have been tested). ProcPolice is available in source code form from
svn://t2.unl.edu/brian/proc_policeOr as a RHEL5-compatible RPM.
ProcPolice was invented with a few specific requirements in mind:
- Prevent batch system-based processes from outliving the lifetime of the batch system job without changing the runtime of the job itself.
- Do this without support in the batch system itself.
- Detect when failures occur.
- Support RHEL5 (the OS used by the LHC for the next few years).