Friday, November 11, 2011

Improving the glexec-enabled life

Pilot-based workflow management systems have had a dramatic transformation of how we view the grid today.  Instead of queueing a job (the "payload") in a workflow onto a site on a grid, these systems send an "empty" job that starts up, then downloads and starts the payload from from a central endpoint.  In CS terms, it switches from a model of "work delegation" to "resource allocation".  By allocating the resource (i.e., starting the pilot job) prior to delegating work, users no longer have to know the vagaries/failure modes of direct grid submission and don't have to pay the price of sending their payloads to a busy site!

In short, pilot jobs make the grid much better.

However, like most concepts, pilot jobs are a trade-off: they make life easier for users, but harder for security folks and sysadmins.  Pilots are sent using one certificate, but payloads are run under a different identity.  If the payload job wants to act on behalf of the user, it needs to bring the user's grid credentials to the worker node.  [Side note: this is actually an interesting assumption.  The PanDA pilot system, heavily utilized by ATLAS, does not bring credentials to the worker node.  This simplifies this problem, but opens up a different set of concerns.]  If both pilot and payload are run as the same Unix user, the payload user can easily access the credentials (including the pilot credentials), executables, and output data of other running payloads.

The program glexec is a "simple" idea to solve this problem: given a set of grid credentials, launch a process under corresponding the Unix account at the site.  For example, with credentials from the HCC VO:

[bbockelm@brian-test ~]$ whoami
[bbockelm@brian-test ~]$ GLEXEC_CLIENT_CERT=/tmp/x509up_u1221 /usr/sbin/glexec 

(You'll notice the invocation is not as simple as typing "glexec whoami"; it's not exactly designed for end-user invocation).  To achieve the user switching, glexec has to be setuid root.  Setuid binaries must be examined under a security microscope, which have unfortunately led to a slow adoption of glexec.

The idea is that pilot jobs would wrap the payload with a call to "glexec", separating the payload from the pilot and other payloads.  From there, it goes horribly wrong.  Not wrong really - but rather things get sticky.

Since the pilot and payload are both low-privileged users, the pilot doesn't have permission to clean up or kill the payload.  It must again use glexec to send signals and delete sandboxes.  The several invocations are easy to screw up (and place load on the authorization system!).  There are tricky error conditions - if authorization breaks in the middle of the job, how does the pilot clean up the payload?

As the payload is a full-fledged Linux process, it can create other processes, daemonize, escape from the batch system, etc.  As previously discussed, the batch system - with root access - typically does a poor job tracking processes.  The pilot will be hopeless unless we provide some assistance.

Glexec imposes an integration difficulty at some sites.  There are popular cron scripts that kill process belonging to users on a node that aren't currently running batch system jobs.  So, if the pilot maps to "cms" and the payload maps to "cmsuser", the batch system only knows about "cms", and the cronjob will kill all processes belonging to "cmsuser".  We lost quite a few jobs at some sites before we figured this out!

Site admins manage the cluster via the batch system.  Since the payload is invisible to the batch system, we're unable to kill jobs from a user with batch system tools (condor_rm, qdel).  In fact, if we get an email from a user asking for help understanding their jobs, we can't even easily find where the job is running!  Site admins have to ssh into each worker node and examine the running jobs; a process that is simply medieval.

Finally, on the OSG, invoking the authorization system requires host certificate credentials.  This is not a problem when host certs are needed for a handful of CEs at the site, but explodes when glexec is run on each worker node.  This is a piece of unique state on the worker nodes for sites to manage, adding to the glexec headache.

We're the Government.  We're here to help.

The OSG Technology group has decided to tackle the three biggest site-admin usability issues in glexec:

  1. Batch system integration: The Condor batch system provides the ability for running jobs to update the submit node with arbitrary status.  We have developed a plugin that updates the job's ClassAd with the payload's DN whenever glexec is invoked.
  2. Process tracking: There is an existing glexec plugin to do process tracking.  However, this requires a admin to set up secondary GID ranges (an administration headache) and suffers the previously-documented process tracking issues.  We will port the ProcPolice daemon over to the glexec plugin framework.
  3. Worker node certificates: We propose to fix this via improvements to GUMS, allowing the mappings to be performed based on the presence of "Role=pilot" VOMS extension in the pilot certificate.
The plugins in (1) and (2) have been prototyped, and are available in the osg-development repository as "lcmaps-plugins-condor-update" and "lcmaps-plugins-process-tracking", respectively.  The third item is currently cooking.

The "lcmaps-plugins-condor-update" is especially useful, as it's a brand-new capability as opposed to an improvement.  It  advertises three attributes in the job's ClassAd:
  • glexec_x509userproxysubject: The DN of the payload user.
  • glexec_user: The Unix username for the payload.
  • glexec_time: The Unix time when glexec was invoked.
We can then use it to filter and locate jobs.  For example, if a user named Ian complains his jobs are running slowly, we could locate a few with the following command:

[bbockelm@t3-sl5 ~]$ condor_q -g -const 'regexp("Ian", glexec_x509userproxysubject)' -format '%s ' ClusterId -format '%s\n' RemoteHost | head