Thursday, December 29, 2011

What's the hold-up?

Do you have the following diagram memorized?

If your site runs Condor, you probably should.  It shows the states of the condor_startd, the activities within the state, and the transitions between them.  If you want to have jobs reliably pre-empted (or is that killed?  Or vacated?) from the worker node for something like memory usage, a clear understanding is required.

However, the 30 state transitions might be a bit much for some site admins who just want to kill jobs that go over a memory limit.  In such a case, admins can utilize the SYSTEM_PERIODIC_REMOVE or the SYSTEM_PERIODIC_HOLD configuration parameters on the condor_schedd to respectively remove or hold jobs.

These expressions periodically evaluate the schedd's copy of the job ClassAd (by default, once every 60s); if they evaluate to true for a given job, they will remove or hold it.  This will almost immediately preempt execution on the worker node.

[Note: While effective and simple, these are not the best way to accomplish these sort of policies!  As the worker node may talk to multiple schedd's (via flocking, or just through a complex pool with many schedd's), it's best to express the node's preferences locally.]

At HCC, the periodic hold and release policy looks like this:

# hold jobs using absurd amounts of disk (100+ GB)
   (JobStatus == 1 || JobStatus == 2) && ((DiskUsage > 100000000 || ResidentSetSize > 1600000))

# forceful removal of running after 2 days, held jobs after 6 hours,
# and anything trying to run more than 10 times
   (JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*6) || \
   (JobStatus == 2 && CurrentTime - EnteredCurrentStatus > 3600*24*2) || \
   (JobStatus == 5 && JobRunCount >= 10) || \
   (JobStatus == 5 && HoldReasonCode =?= 14 && HoldReasonSubCode =?= 2)

We place anything on hold that goes over some pre-defined resource limit (disk usage or memory usage).  Jobs are removed if they have been on hold for a long time, have run for too long, have restarted too many times, or are missing their input files.

Note that this is a flat policy for the cluster - heterogeneous nodes with larges amounts of RAM per core would not be well-utilized.  We could tweak this by having users utilize the RequestMemory attribute to their job's ad (defaulting to 1.6GB), place into the Requirements that the slot have sufficient memory, and have the node only accept jobs that request memory below a certain threshold.  The expression above could then be tweaked to hold jobs where (ResidentSetSize > RequestMemory).  Perhaps more on that in the future if we go this route.

While the SYSTEM_PERIODIC_* expressions are useful, Dan Bradley recently introduce me to the SYSTEM_PERIODIC_*_REASON parameter.  This allows you to build a custom hold message for the user whose jobs you're about to interrupt.  The expression is evaluated within the context of the job's ad, and the resulting string is placed in the job's HOLD_REASON.  As an example, previously, the hold message was something bland and generic:

The SYSTEM_PERIODIC_HOLD  expression evaluated to true.

Why did it evaluate to true?  Was it memory or disk usage?  When it was held, how bad was the disk/memory usage?  These things can get lost in the system.  Oops.  We added the following to our schedd's configuration:

# Report why the job went on hold.
   strcat("Job in status ", JobStatus, \
   " put on hold by SYSTEM_PERIODIC_HOLD due to ", \
   ifThenElse(isUndefined(DiskUsage) || DiskUsage < 100000000, \
      strcat("memory usage ", ResidentSetSize), \
      strcat("disk usage ", DiskUsage)), ".")

Now, we have beautiful error messages in the user's logs explaining the issue:

Job in status 2 put on hold by SYSTEM_PERIODIC_HOLD due to memory usage 1620340."

One less thing to get confused about!

Friday, December 23, 2011

A simple iRODS Micro-Service


The goal I had for this task was to identify and understand the steps and configurations involved in writing a micro-service and seeing it in action - for details regarding iRODS please refer to documentation at The micro-service that I wrote is very simplistic (it writes a hello world message to the system log), however it serves its purpose by providing an overview of steps that will be involved in writing a useful micro-service. 

Before I document the configurations and codes involved in creating and registering the new micro-service let’s look at figure 1.

Figure 1 shows a high level view of  invocation of a micro-service by the iRODS rules engine. One way of looking at the micro-service and the iRODS rule engine is to think of it as an event based triggering system that can perform ‘operations’ on the data objects, and/or external resources. The micro-services are registered in iRODS rule definitions and the rule engine invokes them based on the condition specified for that rule. For a list of places in the iRODS workflow where a micro-service may be triggered please visit:

Also you may refer to for a detailed diagram of a micro-service invocation.

Figure 2 above shows the communication between the iRODS rule engine and a micro-service. A simplistic view of the communication layers is that the rule engine calls a defined C procedure, which exposes its functionality through an interface (commonly prefixed with msi). The arguments to the procedure are passed through a structure named msParam_t that is defined below:

typedef struct MsParam {
  char *label;
  char *type;         /* this is the name of the packing instruction in
                       * rodsPackTable.h */
  void *inOutStruct;
  bytesBuf_t *inpOutBuf;
} msParam_t;

Writing the micro-service

Figure 3 shows the steps involved in creating a new micro-service:

Write the C procedure

The C code below (lets call it test.c) has a function writemessage that writes a message to the system log. There is an interface to the function named msiWritemessage which exposes the writemessage function. The msi function takes a list of arguments of type msParam_t and a last argument of type ruleExecInfo_t for the result of the operation.

#include <stdio.h>
#include <unistd.h>
#include <syslog.h>
#include <string.h>
#include "apiHeaderAll.h"

void writemessage(char arg1[], char arg2[]);
int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei);

void writemessage(char arg1[], char arg2[]) {
    openlog("slog", LOG_PID|LOG_CONS, LOG_USER);
    syslog(LOG_INFO, "%s %s from micro-service", arg1, arg2);

int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei)
 char *in1;
 int *in2;
 RE_TEST_MACRO ("    Calling Procedure");
 // the above line is needed for loop back testing using irule -i option
 if ( strcmp( mParg1->type, STR_MS_T ) == 0 )
    in1 = (char*) mParg1->inOutStruct;
 if ( strcmp( mParg2->type, INT_MS_T ) == 0 )
    in2 = (int*) mParg2->inOutStruct;
 writemessage(in1, in1);
 return rei->status;

Next I will make a folder structure in the module folder of iRODS home for placing this micro-service and copy a few files from an example properties module and modify them to fit the test.c micro-service

cd ~irods
mkdir modules/HCC
cd modules/HCC

mkdir microservices
mkdir rules
mkdir lib
mkdir clients
mkdir servers

mkdir microservices/src
mkdir microservices/include
mkdir microservices/obj
cp ../properties/Makefile .
cp ../properties/info.txt .

Listed below is my working copy of Makefile and the info.txt

ifndef buildDir
buildDir = $(CURDIR)/../..

include $(buildDir)/config/
include $(buildDir)/config/
include $(buildDir)/config/
include $(buildDir)/config/

# Directories
MSObjDir =    $(modulesDir)/HCC/microservices/obj
MSSrcDir =    $(modulesDir)/HCC/microservices/src
MSIncDir =    $(modulesDir)/HCC/microservices/include

# Source files

OBJECTS =    $(MSObjDir)/test.o

# Compile and link flags

.PHONY: all server client microservices clean
.PHONY: server_ldflags client_ldflags server_cflags client_cflags
.PHONY: print_cflags

# Build everytying
all:    microservices

# List module's objects and needed libs for inclusion in clients

# List module's includes for inclusion in the clients

# List module's objects and needed libs for inclusion in the server
    @echo $(OBJECTS) $(LIBS)

# List module's includes for inclusion in the server
    @echo $(INCLUDE_FLAGS)

# Build microservices
microservices:    print_cflags $(OBJECTS)

# Build client additions

# Build server additions

# Build rules

# Clean
    @echo "Clean image module..."
    rm -rf $(MSObjDir)/*.o

# Show compile flags
    @echo "Compile flags:"
    @echo "    $(CFLAGS_OPTIONS)"

# Compile targets
$(OBJECTS): $(MSObjDir)/%.o: $(MSSrcDir)/%.c $(DEPEND)
    @echo "Compile image module `basename $@`..."
    @$(CC) -c $(CFLAGS) -o $@ $<


Name:        HCC
Brief:        HCC Test microservice
Description:    HCC Test microservice.
Enabled:    yes
Creator:    Ashu Guru
Created:    December 2011
License:    BSD

In the next step I will define the micro-service header and micro-service table files so that the iRODS can be configured with the new micro-service. This is done in the folder microservices/include. In this example  there is no header for this code so I have left the header file blank;  in the micro-service table file I have the entry for the table definition.  The specifics to note below are that the first argument is the label of the micro-service, the second argument is the count of input arguments  (do not count the ruleExecInfo _t argument) of the msi interface and the third argument is the name of the msi interface function.

File microservices/include/microservices.table

{ "msiWritemessage",2,(funcPtr) msiWritemessage },   

Following is the directory tree structure for the HCC module that I have so far:
bash-4.1$ pwd 
bash-4.1$ tree HCC
├── clients
├── info.txt
├── lib
├── Makefile
├── microservices
│   ├── include
│   │   ├── microservices.header
│   │   ├── microservices.table
│   ├── obj
│   └── src
│       ├── test.c
├── rules
└── servers

Next I will make an entry for enabling the new module (this micro-service), this is done in the file ~irods/config/ so that the iRODS Makefile can include the new micro-service for build. To do this simply add the module folder name (in my case HCC) to the variable MODULES.

Compile and test

cd ~irods/modules/<YOURMODULENAME>

The above commands should result in creation of an object file in the micro-service/obj folder. I am going to test the micro-service manually first, to accomplish this I will create a client side rule file in the folder ~irods/ clients/icommands/test/rules. I have named the file and following are the contents of the file:

The first line in file  is the rules definition and the second line are the input parameters. To test the micro-service I will  invoke the micro-service which will then write a message to the system log (see figure below).

Recompile iRODS

Before this step I must make the entries for the headers and the msi table in the iRODS main micro-service action table (i.e. file ~irods/server/re/include/reAction.h). This should be done using the following commands:

rm server/re/include/reAction.h
make reaction 

However, I had to manually add the code segment below to the file server/re/include/reAction.h file to accomplish that:

int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei);
Finally, recompile iRODS

cd ~irods
make test_flags
make modules
./irodsctl stop
make clean
./irodsctl start
./irodsctl status

Register Micro-service and Test

In this step we define a rule that will trigger the micro-service when a new data object is uploaded to iRODS. Open the file ~irods/server/config/reConfigs/ and add the following line  the Test Rules section.

acPostProcForPut {msiWritemessage("HelloWorld","String 2"); }

That is it… if now I put (iput) any file into iRODS a message is added to the /var/log/messages file on the iRODS server. Please note that the above rule is not filtering a particular occurrence but is a catchall rule that applies to all put events.


Thursday, December 15, 2011

How to create openstack controller

As before, the "official" instructions on which our procedure is based are here:

First setup the repository:

rpm -ivh openstack-repo-2011.2-1.el6.noarch.rpm

Then install openstack and dependencies

yum install libvirt
chkconfig libvirtd on
/etc/init.d/libvirtd start
yum install euca2ools openstack
nova-{api,compute,network,objectstore,scheduler,volume} openstack-nova-cc-config openstack-glance

Start services:

service mysqld start
chkconfig mysqld on
service rabbitmq-server start
chkconfig rabbitmq-server on

Setup database authorisations. First set up root password:

mysqladmin -uroot password

Now, to automate the procedure create an executable shell script

with the following content (fill the relevant user name and password fields as well as the IP's):



#CC_HOST="A.B.C.D" # IPv4 address
CC_HOST="" # IPv4 address, fill your own
#HOSTS='node1 node2 node3' # compute nodes list
HOSTS='' # compute nodes list, fill your own

mysqladmin -uroot -p$PWD -f drop nova
mysqladmin -uroot -p$PWD create nova

for h in $HOSTS localhost; do
echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO '$DB_USER'@'$h' IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql
echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO root IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql

And now execute this script:


Create db schema

nova-manage db sync

Now comes point which is not in the "official" instructions. The installation will not work unless you patch your python:

patch -p0 < rhel6-nova-network-patch.diff

Create logical volumes:

lvcreate -L 1G --name test nova-volumes

For your convenience create an openstack startup shell script

Here is its content:

for n in api compute network objectstore scheduler volume; do
service openstack-nova-$n $@;
service openstack-glance-api $@

And finally we are ready to start openstack: start

With fingers crossed you should get

Starting OpenStack Nova API Server: [ OK ]
Starting OpenStack Nova Compute Worker: [ OK ]
Starting OpenStack Nova Network Controller: [ OK ]
Starting OpenStack Nova Object Storage: [ OK ]
Starting OpenStack Nova Scheduler: [ OK ]
Starting OpenStack Nova Volume Worker: [ OK ]
Starting OpenStack Glance API Server: [ OK ]

Now we need to configure and customize the installation which is another story for another day...

./ start

If everything goes fine

Starting OpenStack Nova API Server: [ OK ]
Starting OpenStack Nova Compute Worker: [ OK ]
Starting OpenStack Nova Network Controller: [ OK ]
Starting OpenStack Nova Object Storage: [ OK ]
Starting OpenStack Nova Scheduler: [ OK ]
Starting OpenStack Nova Volume Worker: [ OK ]
Starting OpenStack Glance API Server: [ OK ]

How to create openstack worker node

The "official" instructions how to install openstack components are located here:

Unfortunately they are not very clear and miss some key points. Below is summary of our installation procedure.

First of all, let us install worker node.

rpm -ivh openstack-repo-2011.2-1.el6.noarch.rpm
yum install libvirt
chkconfig libvirtd on
/etc/init.d/libvirtd start
yum install openstack-nova-compute openstack-nova-compute-config
service openstack-nova-compute start

If everything goes fine you should see

Starting OpenStack Nova Compute Worker: [ OK ]

Thursday, December 8, 2011

Network Accounting for Condor

It's been a long time since the August post describing how to set up manual network accounting for a process.  We now have a solution integrated into Condor and available on github.  It requires a bit to understand how it works, so I've put together a series of diagrams to illustrate it.

First, we start off with the lowly condor_starter on any worker node with an network connection (to simplify things, I didn't draw the other condor processes involved):
By default, all processes on the node are in the same network namespace (labelled the "System Network Namespace" in this diagram).  We denote the network interface with a box, and assume it has address

Next, the starter will create a pair of virtual ethernet devices.  We will refer to them as pipe devices, because any byte written into one will come out of the other - just how a venerable Unix pipe works:
By default, the network pipes are in a down state and have no IP address associated with them.  Not very useful!  At this point, we have some decisions to make: how should the network pipe device be presented to the network?  Should it be networked at layer 3, using NAT to route packets?  Or should we bridge it at layer 2, allowing the device to have a public IP address?

Really, it's up to the site, but we assume most sites will want to take the NAT approach: the public IP address might seem useful, but would require a public IP for each job.  To allow customization, all the routing is done by a helper script, but provide a default implementation for NAT.  The script:
  • Takes two arguments, a unique "job identifier" and the name of the network pipe device.
  • Is responsible for setting up any routing required for the device.
  • Must create an iptables chain using the same name of the "job identifier".
    • Each rule in the chain will record the number of bytes matched; at the end of the job, these will be reported in the job ClassAd using an attribute name identical to the comment on the rule.
  • On stdout, returns the IP address the internal network pipe should use.
Additionally, the Condor provides a cleanup script does the inverse of the setup script.  The result looks something like this:
Next, the starter forks a separate process in a new network namespace using the clone() call with the CLONE_NEWNET flag.  Notice that, by default, no network devices are accessible in the new namespace:
Next, the external starter will pass one side of the pipe to the other namespace; the internal stater will do some minimal configuration of the device (default route, IP address, set the device to the "up" status):
Finally, the starter exec's to the job.  Whenever the job does any network operations, the bytes are routed via the internal network pipe, come out the external network pipe, and then are NAT'd to the physical network device before exiting the machine.

As mentioned, the whole point of the exercise is to do network accounting.  Since all packets go through one device, Condor can read out all the activity via iptables.  The "helper script" above will create a unique chain per job.  This allows some level of flexibility; for example, the chain below allows us to distinguish between on-campus and off-campus packets:

Chain JOB_12345 (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  veth0  em1     anywhere           /* OutgoingInternal */
    0     0 ACCEPT     all  --  veth0  em1     anywhere            !        /* OutgoingExternal */
    0     0 ACCEPT     all  --  em1    veth0        anywhere             state RELATED,ESTABLISHED /* IncomingInternal */
    0     0 ACCEPT     all  --  em1    veth0  !        anywhere             state RELATED,ESTABLISHED /* IncomingExternal */
    0     0 REJECT     all  --  any    any     anywhere             anywhere             reject-with icmp-port-unreachable

Thus, the resulting ClassAd history from this job will have an attribute for NetworkOutgoingInternal, NetworkOutgoingExternal, NetworkIncomingInternal, and NetworkIncomingInternal.  We have an updated Condor Gratia probe that looks for Network* attributes and reports them appropriately to the accounting database.

Thus, we have byte-level network, allowing us to answer the age-old question of "how much would a CMS T2 cost on Amazon EC2?".  Or perhaps we could answer "how much is a currently running job going to cost me?" Matt has pointed out the network setup callout could be used to implement security zones, isolating (or QoS'ing) jobs of certain users at the network level.  There are quite a few possibilities!  

We'll definitely be returning to this work mid-2012 when the local T2 is based on SL6, and this patch can be put into production.  There will be some further engagement with the Condor team to see if they're interested in taking the patch.  The Gratia probe work to manage network information will be interesting upstream too.  Finally, I encourage interested readers to take a look at the github branch.  The patch itself is a tour-de-force of several dark corners of Linux systems programming (involves using clone, synchronization between processes with pipes, sending messages to the kernel via netlink to configure the routing, and reading out iptables configurations using C).  It was very rewarding to implement!

Thursday, December 1, 2011

Details on glexec improvements

My last blog post gave a quick overview of why glexec exists, what issues folks run into, and what we did to improve it.  Let's go into some details.

How Condor Update Works
The lcmaps-plugin-condor-update package contains the modules necessary to advertise the payload certificate of the last glexec invocation in the pilot's ClassAd.  The concept is simple - the implementation is a bit tricky.

For a long time, Condor has had a command-line tool called condor_advertise for awhile; it allows an admin to hand-advertise updates to ads in the collector.  Unfortunately, that's not quite what we need here: we want to update the job ad in the schedd, while condor_advertise typically updates the machine ad in the collector.  Close, but no cigar.

There's a lesser-known utility called condor_chirp that we can use.  Typically, condor_chirp is used to do I/O between the schedd and the starter (for example, you can pull/push files on demand in the middle of the job), but it can also update the job's ad in the schedd.  The syntax is simple:

condor_chirp ATTR_NAME ATTR_VAL

(look at the clever things Matt does with condor_chirp).  As condor_chirp allows additional access to the schedd, the user must explicitly request it in the job ad.  If you want to try it out, you must add the following line into your submit file:


To work, chirp must know how to contact the starter and have access to the "magic cookie"; these are located inside the $_CONDOR_SCRATCH_DIR, as set by Condor in the initial batch process.  As the glexec plugin runs as root (glexec must be setuid root to launch a process as a different UID), we must guard against being fooled by the invoking user.
Accordingly, the plugin uses /proc to read the parentage of the process tree until it finds a process owned by root.  If this is not init, it is assumed the process is the condor_starter, and the job's $_CONDOR_SCRATCH_DIR can be deduced from the $CWD and the PID of the starter.  Since we only rely on information from root-owned processes, we can be fairly sure this is the correct scratch directory.  As a further safeguard, before invoking condor_chirp, the plugin drops privilege to that of the invoking user.  Along with the other security guarantees provided by glexec, we have confidence that we are reading the correct chirp configuration and are not allowing the invoker to increase its privileges.

Once we know how to invoke condor_chirp, the rest of the process is all downhill.  glexec internally knows the payload's DN, the payload Unix user, and does the equivalent of the following:

condor_chirp set_job_attr glexec_user "hcc"
condor_chirp set_job_attr glexec_x509userproxysubject "/DC=org/DC=cilogon/C=US/O=University of Nebraska-Lincoln/CN=Brian Bockelman A621"
condor_chirp set_job_attr glexec_time 1322761868

condor_chirp writes the data into the starter, which then updates the shadow, then the schedd (some of the gory details are covered in the Condor wiki).

The diagram below illustrates the data flow:

Putting this into Play
If you really want to get messy, you can check out the source code from Subversion at:
(web view)

The current version of the plugin is 0.0.2.  It's available in Koji, or via yum in the osg-development repository:

yum install --enablerepo=osg-development lcmaps-plugins-condor-update

(you must already have the osg-release RPM installed and glexec otherwise configured).

After installing it, you need to update the /etc/lcmaps.db configuration file on the worker node to invoke the condor-update module.  In the top half, I add:

condor_updates = "lcmaps_condor_update.mod"

Then, I add condor-update to the glexec policy:


verifyproxy -> gumsclient
gumsclient -> condor_updates
condor_updates -> tracking

Note we use the "tracking" module locally; most sites will use the "glexec-tracking" module.  Pick the appropriate one.

Finally, you need to turn on the I/O proxy in the Condor submit file.  We do this by editing  (for RPMs, located in /usr/lib/perl5/vendor_perl/5.8.8/Globus/GRAM/JobManager/  We add the following line into the submit routine, right before queue is added to the script file:

print SCRIPT_FILE "+WantIOProxy=TRUE\n";

All new incoming jobs will get this attribute; any glexec invocations they do will be reflected at the CE!

GUMS and Worker Node Certificates
To map a certificate to a Unix user, glexec calls out to the GUMS server using XACML with a grid-interoperable profile.  In the XACML callout, GUMS is given the payload's DN and VOMS attributes.  The same library (LCMAPS/SCAS-client) and protocol can also make callouts directly to SCAS, more commonly used in Europe.

GUMS is a powerful and flexible authorization tool; one feature is that it allows different mappings based on the originating hostname.  For example, if desired, my certificate could map to user hcc at but map to cmsprod at  To prevent "just anyone" from probing the GUMS server, GUMS requires the client to present X509 a certificate (in this case, the hostcert); it takes the hostname from the client's certificate.

This has the unfortunate side-effect of requiring a host certificate on every node that invokes GUMS; OK for the CE (100 in the OSG), but not for glexec on the worker nodes (thousands on the OSG).

When glexec is invoked in EGI, SCAS is invoked using the pilot certificate for HTTPS and information about the payload certificate in the XACML callout; this requires no worker node host certificate.

To replicate how glexec works in EGI, we had to develop a small patch to GUMS.  When the pilot certificate is used for authentication, the pilot's DN is recorded to the logs (so we know who is invoking GUMS), but the host name is self-reported in the XACML callout.  As the authentication is still performed, we believe this relaxing of the security model is acceptable.

A patched, working version of GUMS can be found in Koji and is available in the osg-development repository.  It will still be a few months before the RPM-based GUMS install is fully documented and released, however.

Once installed, two changes need to be made at the server:

  • Do all hostname mappings based on "DN" in the web interface, not the "CN".
  • Any group of users (for example, /cms/Role=pilot) that want to invoke GUMS must have "read all" access, not just "read self".
Further, /etc/lcmaps.db needs to be changed to remove the following lines from the gumsclient module:

"-cert   /etc/grid-security/hostcert.pem"
"-key    /etc/grid-security/hostkey.pem"
"--cert-owner root"

This will be all automated going forward - but all should help remove some of the pain in deploying glexec!