Wednesday, March 19, 2014

Submitting jobs to HTCondor using Python

I've had several requests to for a tutorial on using the HTCondor python bindings; current documentation resources for these include:


However, more examples are always useful!  This blog entry will attempt to cover the most common use cases - ClassAds, querying HTCondor, and submitting jobs.

Why Python Bindings?

Before we launch into the how, let's examine the why.  The python bindings provide a developer-friendly mechanism for interacting with HTCondor.  A few highlights:

  • They call the HTCondor libraries directly, avoiding a fork/exec of a subprocess.
  • They provide a "pythonic" interaction with HTCondor; the design is meant to be familiar to a python programmer.  Errors raise python exceptions.
  • They have thorough integration with ClassAds.  Because they use the HTCondor implementation of ClassAds, the result is a very complete implementation of the ClassAd language.  ClassAd expressions can be created cleanly without worrying about string quoting issues.
  • Most actions that can be performed through the HTCondor command-line tools are exposed via python.


The bindings themselves are compiled against the system version of python and a specific version of HTCondor.  This limits the portability (you cannot reliably email compiled binaries to others), meaning they are most effective when they are installed onto the system by the sysadmin; that said, they are shipped with all HTCondor versions supported by UW except for Windows.

Loading the modules

The bindings are split into two python modules, htcondor and classad.  To verify your environment is setup correctly, do the following in python:
$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import classad
>>> import htcondor
>>> 
If no exception is thrown, you are ready to proceed to the next section!  If an exception is thrown, check your HTCondor installation and the value of the PYTHONPATH environment variable if you are using a non-root install.

Begin with the Basics: ClassAds

ClassAds are the lingua franca of HTCondor, and hence the basic essential data structure of the python bindings.  Each ad is formed as a set of key-value pairs, where the value is a ClassAd expression (such as 2 + 2).  This differs from a JSON map, where the value must be a literal (4).  When evaluating expressions, one can reference other attributes in the ClassAd.

Consider the following ClassAd interaction:
>>> ad = ClassAd()
>>> ad['foo'] = 1
>>> ad['bar'] = 2
>>> ad['baz'] = ExprTree("foo + bar")
>>> ad
[ baz = foo + bar; bar = 2; foo = 1 ]
>>> ad['baz']
foo + bar
>>> ad['baz'].eval()
3L
>>> 
We first create an empty ClassAd, then do some value assignments in a manner similar to a Python dictionary.  For the baz attribute, we create a new ExprTree (a ClassAd expression) object.  The string given to the ExprTree constructor is parsed as a new python expression.

Note that if we reference baz, the expression itself is returned; if we instead referenced foo, the python object 1 would be returned.  The classad library will coerce references to python objects if possible; if not possible, it will return ExprTrees.  To force the return of an ExprTree, use the lookup method of the ClassAd; to force the return of a python object, use the eval method.

In 8.1.3, HTCondor introduced more convenient ways to build expressions.  We could replace ExprTree("foo + bar") above with:
Attribute("foo") + Attribute("bar")
 We believe that explicitly forming expressions in this manner is less likely to result in quoting issues (analogous to how one avoids SQL injection attacks).

ClassAd expressions include the most common programming operators, lists, sub-ClassAds, attribute references, function calls, strings, numbers, and booleans.  See the full language description for a thorough treatment.

Querying HTCondor

The two most common daemons to query in HTCondor are the collector (which holds descriptions for all daemons running in the pool) and the schedd (which maintains the job queue).

We'll start with the collector.  Begin by creating a Collector object:
>>> coll = Collector()
The collector object will default to the collector daemon in the machine's configuration; alternately, the constructor accepts a hostname as a string argument.

Once created, you can use the query method to get ClassAds from the collector:
>>> ads = coll.query(htcondor.AdTypes.Startd)
len(ads)
>>> len(ads)
4128
This returns a Python list of ClassAds.  By default all attributes for all ClassAds of a given type are returned by query; returning such a large amount of data can take a long amount of time.  Further function arguments refine the amount of data returned:
>>> ads = coll.query(htcondor.AdTypes.Startd, 'Machine =?= "red-d9n1.unl.edu"', ["Name", "RemoteOwner"]) 
>>> len(ads)
15
>>> ads[0]
[ Name = "slot1@red-d9n1.unl.edu"; MyType = "Machine"; TargetType = "Job"; CurrentTime = time() ]
 The second argument provides a ClassAd expression which serves as a filter; the third argument is a list of attributes to include.  Note that the collector may add some default attributes and may not return a requested attribute if it is not present in the ad.

The creation of a Schedd object can be done in a manner similar to the Collector for a local schedd:

>>> schedd = htcondor.Schedd()
Alternately, you can use the Collector's locate method to find a remote Schedd address:
>>> addr = coll.locate(htcondor.DaemonTypes.Schedd, "schedd.example.com")
>>> schedd = htcondor.Schedd(addr)

Once the schedd object is created, the query method is used to list jobs:
>>> jobs = schedd.query()
>>> len(jobs)
2096
Again, additional arguments allow you to trim the number of ads and the number of attributes returned:
>>> jobs = schedd.query('Owner=?="cmsprod088"', ["ClusterId", "JobStatus"])
>>> len(jobs)
336
>>> jobs[0]
[ MyType = "Job"; JobStatus = 2; TargetType = "Machine"; ServerTime = 1395254896; CurrentTime = time(); ClusterId = 2940860 ]
Starting in 8.1.5, the xquery method has been added.  Instead of buffering all ads in memory in the form of a python list,  xquery returns an iterator; reading through the iterator will block as ClassAds are returned by the schedd.  This reduces total memory usage and allows the user to interleave several queries at once.

Submitting Jobs

Submitting jobs is one of the more confusing aspects of the Python bindings for beginners.  This is because job descriptions must be provided as a ClassAd instead of HTCondor submit file format.  The submit file format is a macro substitution language evaluated at submit time.

For example, consider the following submit file:
executable = test.sharguments = foo bar 
log = test.log
output = test.out.$(Process)
error = test.err
transfer_output_files = output
should_transfer_files = yes
queue 1
The equivalent submit ClassAd is:
[
    Cmd = "test.sh";
    Arguments = "foo bar"
    UserLog = "test.log";
    Out = strcat("test.out",ProcId);
    Err = "test.err";
    TransferOutput = "output";
    ShouldTransferFiles = "YES";
]

A few items of note for converting submit files to ClassAds:
  • The translation from the submit file commands to ClassAd attributes often results in different attribute names (executable corresponds to Cmd).  An extensive, but not exhaustive, list of attribute is available in the HTCondor manual.
  • Some submit file commands result in multiple attribute changes in the ClassAd.  If you are unsure how a submit file command maps to a ClassAd, you can run condor_submit -dump /dev/null test.submit to have HTCondor dump the resulting ClassAd to stdout.  This command includes all attributes, including ones that are auto-filled; do not copy the entire ad, but look just for the changes.
  • Submit file commands do not have a type and the quoting rules differs for different commands; you must properly quote strings in the ClassAd using the ClassAd language rules.
  • Macro substitution is not available by ClassAds.  Notice how test.out.$(Process) in the submit file is strcat("test.out",ProcId) in the ClassAd; the latter is evaluated at runtime.
Once you have your ClassAd prepared, submitting it is straightforward:
>>> schedd = htcondor.Schedd()
>>> schedd.submit(ad)
23498
The return value is the Cluster ID.  To submit multiple jobs in the same job cluster, you can pass a second argument to submit.  For example, to submit 5 jobs:
>>> schedd.submit(ad, 5)
23499

Parting Thoughts

In this entry, we covered the basics of using the HTCondor python bindings.  We covered only about 10% of the API; left untouched were advanced ClassAd topics, manipulating jobs, remote submission, and managing running daemons.

I hope to have a few more entries to cover other aspects of the API.  Have a particular request?  Leave a comment!

1 comment:

  1. I would love to see how to do remote submission with the Python API. In particular, how file transfer is handled between the client and server.

    ReplyDelete