Friday, June 24, 2011

Part II: Keeping a mindful eye on your users with ProcPolice.

In Part I of this series, we talked about the various mechanisms a batch system uses to track your job's processes, and concluded the state of the art isn't particularly impressive.  The only way to go is up; this post discusses an improved technique for process tracking in Linux.  It was motivated by this blog post from the author of upstart.  If you feel inspired here, and would like to read some code, it is highly recommended reading.

The previous post went from bad to dire: most batch systems use methods that are easily defeated by the job changing its runtime environment (altering the process group, reparenting to init, or changing the environment). Even when using a reliable tracking method, killing an arbitrary set of processes is not possible.

To top it off - when process tracking or killing goes awry, we have no reliable means to detect when our methods fail.

There's a small, relatively unknown, corner of the Linux kernel that can help us out, the proc connector. A privileged process can connect a socket to the kernel, and receive a stream of messages about processes on the system. Any time one of the following system calls happens:
  • fork/clone
  • exec
  • exit
  • setuid
  • setgid
  • setsid
for a thread or a process (all the events are documented in linux/cn_proc.h in the kernel's sources), the socket receives a message containing all the relevant event details.  By tracking only the the fork and exit events, one can build a process tree in memory, starting with the batch system worker process.

Because it is based on events from the kernel, not periodic polling of /proc from user space, this is a far more reliable method for tracking processes.  With a little help from the kernel, the picture is already brighter!

The drawback here is that, while the message is being processed by user-space, further messages to the socket are buffered in memory.  When the buffer is full, the kernel drops any further messages: the tracker will lose possibly important events.  The event stream is asynchronous: the fork or exit occurs regardless of whether you process the associated message..  Unless the tracking code is particularly slow, it is likely the only case where the buffer overruns is the case where we care about: someone launching a fork bomb to escape the system.

If you have too many message, the first step is to receive less messages.  One can hand the kernel a small program in an special assembly language that pre-filters messages: a message that isn't put into the queue can't overflow it!  Writing these filters are a fun academic exercise, but not useful: when a fork bomb occurs, the messages the processes needs to receive are precisely the ones overflowing the buffer!

So while never failing is preferable, detecting when we have failed is acceptable: when the buffer is full, the next attempt to read a message from the socket will return a ENOBUF to indicate the buffer has overfilled.  Actions can be taken: for most batch systems, a nasty email to the user and sysadmin might be sufficient.  If you work for the NSA, perhaps the appropriate response is to power down the worker node and send out black helicopters.

I've taken the approach outlined here and turned it into a small package called "ProcPolice".  It consists of a simple daemon which listens to the stream of events, and adds the process to an in-memory tree if the process can be tracked back to a batch system job.  ProcPolice will detect when a process reparents to init and, if it is launched from the batch system and non-root, it can log the event or kill the newly-daemonized process.  In testing, it is able stop off simple fork-bombs and detect more sophisticated ones.

As ProcPolice runs as a separate daemon that watches the batch system and intervenes only on daemonized processes, it can be used immediately with any batch system (Condor or PBS have been tested).  ProcPolice is available in source code form from
svn://t2.unl.edu/brian/proc_police
Or as a RHEL5-compatible RPM.

ProcPolice was invented with a few specific requirements in mind:
  1. Prevent batch system-based processes from outliving the lifetime of the batch system job without changing the runtime of the job itself.
  2. Do this without support in the batch system itself.
  3. Detect when failures occur.
  4. Support RHEL5 (the OS used by the LHC for the next few years).
It turns out the last requirement is perhaps the most stringent one; newer kernels have a specific feature for tracking and controlling arbitrary sets of processes.  This is the topic of the next part of this series.

Wednesday, June 15, 2011

Future Computing in Particle Physics Workshop

(Taking a break from working on the next post in the series; it's about half-done, expect it before I head out on vacation this week.  For now, I'll make a note about life in the field.)

I've been invited to talk at the Future Computing in Particle Physics Workshop in Edinburgh, which has the following abstract:
Recent developments in computing and software architectures have resulted in huge potential for accelerating applications used in experimental particle physics. This is an ideal time to investigate how a significant performance boost can be achieved by the effective use of many-core and GPU architectures in a distributed computing environment, as well as utilising emerging I/O and storage technologies. This workshop aims to discuss what has been done so far in the field and what potential future development areas are feasible.
https://indico.cern.ch/conferenceDisplay.py?confId=141309
It's an exciting workshop; the downside is that it started today and I'm on the wrong side of the Atlantic!  Thus, I have the pleasure of attending via videoconference.  While it doesn't truly replace attending, a conference - we all swear half the value of these conference are the discussions that occur during break - there's a few things I've learned:
  • No matter how early your presentation is in your timezone, show up early and ask questions on other presentations.  Besides the obvious good etiquette (if you don't plan on paying attention, decline the invitation), this allows you to test out the quality of the videoconference setup.
  • Find a friend sitting in the remote audience to IM you during the presentation.  When you're physically there, you can gauge interest levels from the audience's body language.  Are they bored?  Can they hear/see you?  Having a spy in the audience helps you get this feedback.
  • My father always says "I can hire a monkey to stand up and read off Powerpoint slides.  They are here to hear you present".  The adage is still partially true, but a larger-than-normal part of the information conveyed to the audience is going to go through these slides.  Spend some extra time on them.
Unfortunately, while the presentations and audio were excellent, the "Whisky Tasting Welcome Reception" doesn't translate well to videoconferencing.

 Now, onto the subject of the workshop: future of computing in particle physics.  I'll be talking about I/O.  Really, it all boils down to two things:
  • There is no magic bullet to make I/O faster.  For what I can tell, the limitation is the complexity of our data structures.  Improvements to the current I/O stack - or a new I/O stack - isn't likely going to turn bad data structures into good ones.
  • We demand remote I/O!  Having batch system access to the wealth of data is great... but it's time to have the ability to do remote I/O also.

Monday, June 13, 2011

Part I: How your batch system watches your processes (and why it's so bad at it)

Series Preamble

Almost every cluster sysadmin has faced a case of "users gone wild"; for us, it's almost always due to users abusing the shared file system or user processes escaping the watchful eye of the batch system.  If I could prevent abuse of the shared file system while keeping it functional, I'd be a rich man.  I'm not a rich man, so I'm going to be talking about the latter issue.  This is a big topic, so I'm going to be splitting it up into a few posts:
  • Part I: How your batch system watches your processes (and why it's so bad at it).
  • Part II: Keeping a mindful eye on your users with ProcPolice.
  • Part III: Death of the fork-bomb: Ironclad process tracking in batch systems.
A few caveats up-front: I'm going to be talking about the platform I know (Linux-based OS's) and the batch systems we use (Condor, PBS, and a bit of SGE).  Apologies to the Windows/obscure-Unix-variant/LSF users out there.

So, onward and upward!

Strategy: Process Groups
Each process on the system belongs to a process group, and the process groups are further grouped into a session (as in, a login session).  Most batch systems, when starting a job, will start the job in a new session and a fresh process group.  Process groups are at their most useful when sending signals: the batch system can send a signal (such as SIGKILL to terminate processes) to a process group.  The kernel does the process tracking and appropriately signals all the processes in a group.

If this worked well, it would be a short blog series.  Unfortunately, any processes can start a new process group - removing it from its previous group.  To hide from the batch system, one can simply daemonize and start a new process group.  Viola!  You've now escaped.

Process groups are utilized by all batch systems; despite not working well, this is the most straightforward mechanism.

Strategy: Process Trees
Another  "process trees".  When a process is created, it has both a process ID and parent ID.  By reading from the /proc filesystem (side note: Mac OS X has a far superior way of getting a programmatic snapshot of all system processes.  Jealous.), the batch system can create a list of all processes on the system, their parents, and builds an in-memory tree of the parent-child relationships.  By starting with the process the batch system launched for the job, and walking through the tree, the batch system can determine all the processes associated with a job.  Creating and updating a process tree is poll-based: the operation is performed every X seconds (a typical value might be X=5).  As long as the lifetime of all processes is X or more, the batch system will see all processes.

Well-behaved batch jobs stay within the process tree started by the batch system; this way, they get monitored by the batch system and their resource usage gets accounted.

The issue here, again, is badly behaved processes.  When a process's parent exits, the child is assigned a new parent: init, or PID 1.  To escape notice of a batch system that uses process trees, one needs to fork twice (creating a child and "grandchild"), then have the immediate child die.  The grandchild is re-parented to init, as if it was a system daemon.   Done quickly enough, and the grandchild has escaped the batch system.

The process tree strategy is used by Condor.

Strategy: Environment Cookies
A process's Unix environment is automatically inherited by its children, and remains unchanged if the parent exits.  Condor currently takes advantage of these facts and inserts an extra environment variable into each batch system job.  If you dump the environment of your current job using "env", you might see something like this:

_CONDOR_ANCESTOR_17948=17952:1307975354:2631244213
_CONDOR_ANCESTOR_17952=18260:1307976308:2791283533
_CONDOR_ANCESTOR_18260=18263:1307976309:1204008886

Each of these are environment variables used by Condor to track the process's ancestry.  In this case, the condor_starter's PID is 18260 and the job's PID is 18263 (the other entries are from parents of the condor_starter process, the condor_startd and condor_master).  Any sub-process started by the job will retain the _CONDOR_ANCESTOR_18260 variable by default.

When Condor polls the /proc filesystem to build a process tree, it can also read out the environment variables and use this information to build the process tree.  As before, this relies on the user being friendly: if the environment variables are changed, then it again can escape the batch system.

Strategy: Supplementary Group IDs
Notice that all strategies so far involve some property of the process which is automatically inherited by its children (the process group, the process ancestry, or the Unix environment variables), but can be changed by the user's job.

A property inherited by subprocesses that cannot be changed without special privilege is the set of group IDs.  Each process has a set of group IDs it is associated with it (if you look at the contents of /proc/self/status, you can see the groups associated with your terminal); it requires administrator privileges to add or remove group IDs, which the batch system has but the user does not.

Condor and SGE can be assigned a range of group IDs to hand out, and assign one of the IDs to the job process they launch.  Assuming there is only one instance of the batch system on the node, any process with that group ID must have come from the batch job.  So, when it comes time to kill batch jobs or perform accounting, we can map any process back to the batch system job.

While the user process cannot get rid of the ID, this setup is still possible to defeat (discussed below), and has a few drawbacks.  The user process now has a new GID, and can create files using that GID; I have no clue how this might be useful, but it's a sign of misusing the GID concept.  Anything that caches the user-to-groups mapping may get the wrong set of GIDs (as having unique per-process GIDs are rare, these caches may have broken assumptions).  Finally, lays extra work on the sysadmin, who now must maintain a range of unused GIDs; they must  sufficient to provide a GID per batch slot.  Locally, we've run into the fact that the number of GIDs increases with the number of cores per node: what was a good setting last year is no longer sufficient.

Note that, with Condor, you can take this one step further and assign a unique user ID per batch slot, and run the job under that UID as opposed to the submitter's UID.  This is a nightmare in terms of NFS-based shared file systems, but the approach at least works on both Unix and Windows.

How to defeat your batch system (inadvertently, right?)
Despite the drawbacks, the supplementary GID mechanism seems pretty foolproof: the user can no longer launch processes that can't be tracked back to a batch slot.  However, this isn't sufficient to stop malicious users.

In order to kill all processes based on some attribute of the process (besides the process group), one must iterate through the contents of the /proc directory, read and parse the process's status file, and send a kill signal as appropriate.  Ultimately, all batch systems currently do some variation of this; if you want a simple source code example, go lookup the sources of the venerable 'killall' utility.

The approach described above does have a fatal flaw: it is not atomic.  Between looking at the contents of /proc, and opening /proc/PID/status, a process could have already forked another child and exited.  Processes may have been spawned between the time when the directory iteration begins and ends, meaning they might never be seen.

Hence, a process may spawn more children in the time the batch system iterates through /proc and kills it; in fact, if the batch system is unlucky, they may do this fast enough the batch system may never detect the process exists in the first place!  In the latter case, regardless of the tracking mechanism, the process may escape the batch system.

Worse, because these short-lived processes can be invisible to the batch system, the batch system may not detect it's being fooled; if the batch system could reliably detect the attack, it might be able to send an alert or turn off the worker node.

Ultimately, the batch system is defeated because it is trying to do process control from user-space.  We lack three things:
  1. Reliably track processes without changing the semantics of the job's runtime environment.
  2. Atomically operations for determining and signaling a set of processes.
  3. Detecting when (1) or (2) have failed.
Luckily, with a little help from the Linux kernel, we can overcome all three of the above issues.  Item (2) takes a fairly modern kernel (2.6.24 or later), but items (1) and (3) can be accomplished with 2.6.0 or later.

As long as we have the ability to detect attacks as in (3), we can limp along until everyone gets onto a modern kernel: this is the topic of the next post.  Stay tuned.

Monday, June 6, 2011

Introductions

This is the (humble) beginnings of the mostly-official blog of the OSG Technology Area.  As such, I feel it's only appropriate to start with the "who" and "what" we are.  Unfortunately, that makes the opening post of this blog rather dry... it'll pick up, I swear.

First, the "what":  The OSG Technology Area provides the OSG with a mechanism for long-term technology planning.  We do this through two sub-groups:

  • Blueprint: Recording the conceptual principles of the OSG and focusing on the long-term evolution of the OSG.  The Blueprint group tries to meet approximately quarterly and, under the direction of the OSG Technical Director, updates the "Blueprint Document" to reflect our understandings of the basic principles, definitions, and the broad outlines of how the pieces fit together.
  • Investigations: We're all surrounded by a continuing onslaught of technologies.  Some are fantastic.  Some are not-so-great.  Some are great, but not what we really needed.  In order to manage the way forward - while keeping to the OSG principles - this group does investigations to understand the concepts, functionality, and impact of external technologies.  The point is to identify items that are potentially disruptive in the medium-term of 12-24 months.
Now, the "who":
  • John Hover: Leader of the Grid Group at Brookhaven National Lab.  John heads up the Blueprint activity.
  • Ashu Guru: Applications specialist at the Holland Computing Center (HCC), based at the University of Nebraska-Lincoln.  Ashu works for Technology Investigations; currently, he's focusing on how virtual machines may be mixed with "traditional" batch systems.  This will be subject of another post.
  • Brian Bockelman (me!):  Yet another HCC staff member.  I've been working heavily in the Technology Area this year; I spent lots of time participating in the rewrite the OSG Blueprint, and am leading the Technology Investigations group.
I'm not going to lie: I have no blogging experience.  However, I am willing to plunge in head-first; please excuse me if I fumble around with the technology a bit (don't worry - I feel much more at home with distributed high throughput computing than social media).  I've got big plans for this blog: I'm hoping to rope the other members of the Technology Area into doing guest posts, pushing for a vibrant OSG-related blogging community, and trying to have this be one of the better outposts for distributed high throughput computing on the internet.

But before all that - I hope you enjoyed the introductions.