OSG Technology Area Rumblings

Submitting jobs to HTCondor using Python

2014-03-19T13:32:00.000-07:00

I've had several requests to for a tutorial on using the HTCondor python bindings; current documentation resources for these include:

My presentation at the 2013 HTCondor week.
The HTCondor manual page.
Python's built-in help() facility.
The HTCondor users mail list.

However, more examples are always useful! This blog entry will attempt to cover the most common use cases - ClassAds, querying HTCondor, and submitting jobs.

Why Python Bindings?

Before we launch into the how, let's examine the why. The python bindings provide a developer-friendly mechanism for interacting with HTCondor. A few highlights:

They call the HTCondor libraries directly, avoiding a fork/exec of a subprocess.
They provide a "pythonic" interaction with HTCondor; the design is meant to be familiar to a python programmer. Errors raise python exceptions.
They have thorough integration with ClassAds. Because they use the HTCondor implementation of ClassAds, the result is a very complete implementation of the ClassAd language. ClassAd expressions can be created cleanly without worrying about string quoting issues.
Most actions that can be performed through the HTCondor command-line tools are exposed via python.

The bindings themselves are compiled against the system version of python and a specific version of HTCondor. This limits the portability (you cannot reliably email compiled binaries to others), meaning they are most effective when they are installed onto the system by the sysadmin; that said, they are shipped with all HTCondor versions supported by UW except for Windows.

Loading the modules

The bindings are split into two python modules, htcondor and classad. To verify your environment is setup correctly, do the following in python:

$ python

Python 2.7.5 (default, Aug 25 2013, 00:04:04)

[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>> import classad

>>> import htcondor

>>>

If no exception is thrown, you are ready to proceed to the next section! If an exception is thrown, check your HTCondor installation and the value of the PYTHONPATH environment variable if you are using a non-root install.

Begin with the Basics: ClassAds

ClassAds are the lingua franca of HTCondor, and hence the basic essential data structure of the python bindings. Each ad is formed as a set of key-value pairs, where the value is a ClassAd expression (such as 2 + 2). This differs from a JSON map, where the value must be a literal (4). When evaluating expressions, one can reference other attributes in the ClassAd.

Consider the following ClassAd interaction:

>>> ad = ClassAd()

>>> ad['foo'] = 1

>>> ad['bar'] = 2

>>> ad['baz'] = ExprTree("foo + bar")

>>> ad

[ baz = foo + bar; bar = 2; foo = 1 ]

>>> ad['baz']

foo + bar

>>> ad['baz'].eval()

3L

>>>

We first create an empty ClassAd, then do some value assignments in a manner similar to a Python dictionary. For the baz attribute, we create a new ExprTree (a ClassAd expression) object. The string given to the ExprTree constructor is parsed as a new python expression.

Note that if we reference baz, the expression itself is returned; if we instead referenced foo, the python object 1 would be returned. The classad library will coerce references to python objects if possible; if not possible, it will return ExprTrees. To force the return of an ExprTree, use the lookup method of the ClassAd; to force the return of a python object, use the eval method.

In 8.1.3, HTCondor introduced more convenient ways to build expressions. We could replace ExprTree("foo + bar") above with:

Attribute("foo") + Attribute("bar")

We believe that explicitly forming expressions in this manner is less likely to result in quoting issues (analogous to how one avoids SQL injection attacks).

ClassAd expressions include the most common programming operators, lists, sub-ClassAds, attribute references, function calls, strings, numbers, and booleans. See the full language description for a thorough treatment.

Querying HTCondor

The two most common daemons to query in HTCondor are the collector (which holds descriptions for all daemons running in the pool) and the schedd (which maintains the job queue).

We'll start with the collector. Begin by creating a Collector object:

>>> coll = Collector()

The collector object will default to the collector daemon in the machine's configuration; alternately, the constructor accepts a hostname as a string argument.

Once created, you can use the query method to get ClassAds from the collector:

>>> ads = coll.query(htcondor.AdTypes.Startd)

len(ads)

>>> len(ads)

4128

This returns a Python list of ClassAds. By default all attributes for all ClassAds of a given type are returned by query; returning such a large amount of data can take a long amount of time. Further function arguments refine the amount of data returned:

>>> ads = coll.query(htcondor.AdTypes.Startd, 'Machine =?= "red-d9n1.unl.edu"', ["Name", "RemoteOwner"])

>>> len(ads)

15

>>> ads[0]

[ Name = "slot1@red-d9n1.unl.edu"; MyType = "Machine"; TargetType = "Job"; CurrentTime = time() ]

The second argument provides a ClassAd expression which serves as a filter; the third argument is a list of attributes to include. Note that the collector may add some default attributes and may not return a requested attribute if it is not present in the ad.

The creation of a Schedd object can be done in a manner similar to the Collector for a local schedd:

>>> schedd = htcondor.Schedd()

Alternately, you can use the Collector's locate method to find a remote Schedd address:

>>> addr = coll.locate(htcondor.DaemonTypes.Schedd, "schedd.example.com")
>>> schedd = htcondor.Schedd(addr)

Once the schedd object is created, the query method is used to list jobs:

>>> jobs = schedd.query()
>>> len(jobs)
2096

Again, additional arguments allow you to trim the number of ads and the number of attributes returned:

>>> jobs = schedd.query('Owner=?="cmsprod088"', ["ClusterId", "JobStatus"])

>>> len(jobs)

336

>>> jobs[0]

[ MyType = "Job"; JobStatus = 2; TargetType = "Machine"; ServerTime = 1395254896; CurrentTime = time(); ClusterId = 2940860 ]

Starting in 8.1.5, the xquery method has been added. Instead of buffering all ads in memory in the form of a python list, xquery returns an iterator; reading through the iterator will block as ClassAds are returned by the schedd. This reduces total memory usage and allows the user to interleave several queries at once.

Submitting Jobs

Submitting jobs is one of the more confusing aspects of the Python bindings for beginners. This is because job descriptions must be provided as a ClassAd instead of HTCondor submit file format. The submit file format is a macro substitution language evaluated at submit time.

For example, consider the following submit file:

executable = test.sharguments = foo bar
log = test.log
output = test.out.$(Process)
error = test.err
transfer_output_files = output
should_transfer_files = yes
queue 1

The equivalent submit ClassAd is:

[
  Cmd = "test.sh";
  Arguments = "foo bar"
  UserLog = "test.log";
  Out = strcat("test.out",ProcId);
Err = "test.err";
TransferOutput = "output";
  ShouldTransferFiles = "YES";
]

A few items of note for converting submit files to ClassAds:

The translation from the submit file commands to ClassAd attributes often results in different attribute names (executable corresponds to Cmd). An extensive, but not exhaustive, list of attribute is available in the HTCondor manual.
Some submit file commands result in multiple attribute changes in the ClassAd. If you are unsure how a submit file command maps to a ClassAd, you can run condor_submit -dump /dev/null test.submit to have HTCondor dump the resulting ClassAd to stdout. This command includes all attributes, including ones that are auto-filled; do not copy the entire ad, but look just for the changes.
Submit file commands do not have a type and the quoting rules differs for different commands; you must properly quote strings in the ClassAd using the ClassAd language rules.
Macro substitution is not available by ClassAds. Notice how test.out.$(Process) in the submit file is strcat("test.out",ProcId) in the ClassAd; the latter is evaluated at runtime.

Once you have your ClassAd prepared, submitting it is straightforward:

>>> schedd = htcondor.Schedd()
>>> schedd.submit(ad)
23498

The return value is the Cluster ID. To submit multiple jobs in the same job cluster, you can pass a second argument to submit. For example, to submit 5 jobs:

>>> schedd.submit(ad, 5)
23499

Parting Thoughts

In this entry, we covered the basics of using the HTCondor python bindings. We covered only about 10% of the API; left untouched were advanced ClassAd topics, manipulating jobs, remote submission, and managing running daemons.

I hope to have a few more entries to cover other aspects of the API. Have a particular request? Leave a comment!

Introducing the HTCondor-CE

2013-01-28T07:33:00.001-08:00

At the heart of the OSG Compute Element (CE) is the gatekeeper software. The gatekeeper software anchors three core pieces of functionality:

Remote access: The gatekeeper provides a network service that remote clients can contact and interact with.
Authentication and authorization: The gatekeeper is responsible for authenticating the client and deciding on what actions it is authorized to perform.
Resource allocation: The gatekeeper accepts an abstract description of a resource to allocate and actualizes the resource request within the local environment.

The existing software, Globus GRAM, provides a HTTP-like interface over TLS for remote access. The authentication is done using the Grid Security Infrastructure (GSI), using special client certificates. It does authorization by performing a callout to map the client certificate to a Unix account, then performing all further operations as that Unix user. The resource allocation provides an interface which accepts requests in Globus RSL (a job description language) and interact with a local batch system on the CE to run the job.

With the HTCondor team, the OSG has been working to provide an alternate gatekeeper implementation, the HTCondor-CE. The HTCondor-CE is a special configuration of the HTCondor software which provides the three core pieces of functionality described above.

HTCondor provides remote access using a custom communication protocol and called CEDAR. CEDAR provides a RPC and messaging mechanism over UDP or TCP, and can provide various levels of integrity or encryption based upon the session parameters. While the HTCondor-CE will ship with the same GSI authentication and authorization as Globus GRAM, it can be reconfigured to provide alternate authentication mechanisms such as Kerberos, SSL, shared secret, or even IP-based authentication.

The HTCondor-CE allocates resources via having the client submit HTCondor jobs to a scheduler running on the CE (the schedd daemon). We refer to this as the "grid job". A separate daemon, the JobRouter, is responsible for transforming the grid job to a resource allocation for site. For a site with a HTCondor batch system, it will transform and mirror the grid job into the routed job in the site's batch system. The process is illustrated below:

The submit workflow for the HTCondor-CE running on a site with the HTCondor batch system. Notice the JobRouter copies the job directly into the site's batch system.

For sites with the PBS batch system, the routed job stays in the HTCondor-CE schedd (as the JobRouter does not know how to submit directly into the PBS queue), and the job is submitted into PBS using the blahp daemon. See the illustration below:

The HTCondor-CE submit workflow for a PBS site. Notice the blahp, not the JobRouter, does the submission to PBS in this case.

The blahp daemon is a common piece of software for interacting with batch systems - in addition to being integrated in the HTCondor grid universe, it also is used by the BOSCO project and the CREAM CE.

Note there is no requirement that the job be routed into a batch system - given the appropriate transform logic. the JobRouter could also transform the grid job into VM running in Amazon EC2, an OpenStack instance, or a job for another HTCondor-CE!

The CE is quite flexible; it is a configuration of the HTCondor software and leverages all the features available in HTCondor. As another example, we benefit from the fact that HTCondor's security uses sessions; clients do not re-authenticate for each status update. Future features, such as the sandbox size limits in the upcoming 7.9.4, can be used immediately by the CE through a configuration file change.

The HTCondor-CE is currently under development, although functionality has been demonstrated using glideinWMS for up to 5,000 running pilots. It requires HTCondor 7.9.2 or later, so we are waiting for the next stable release (due late April) before starting to release the CE more widely. As we near release, I am planning on doing additional updates on specific pieces of this technology.

We're looking forward to see how users will put it into action!

Fun with ClassAds

2013-01-05T14:58:00.000-08:00

One of the new technologies the OSG Technology area is working on is the HTCondor-CE. While that is a topic for a different post, it led me on a surprising journey over my Christmas break.

Working with the HTCondor-CE, I found that creating a job hook to be surprisingly difficult. A job hook for HTCondor is an external script, invoked by HTCondor in lieu of running internal logic. This allows a sysadmin to add custom logic to HTCondor internals without resorting to writing C++ code. The hook in question is the job transformation step for the JobRouter.

The problem with hooks is they are surprisingly difficult to write. For the transform hook, a job's ClassAd is written to the script's stdin and the JobRouter expects to read the transformed ClassAd from stdout. [Actually, it's a touch more complicated than that, but this simplification will do for our discussion.] ClassAds are an expressive and powerful language - but a language difficult to parse via Unix scripting! There are complex quoting and attribute evaluation rules.

Sysadmins are left with a decision - either spend quite some time implementing a ClassAd parser or only do the bare minimum and hope no one submits a complex ClassAd. I found the situation unsatisfactory and decided to write python bindings for the ClassAd library.

I found the endeavor fairly straightforward using the Boost.Python library, and ended up with a new GitHub project. Now, a job transform hook is as simple as this:

#!/usr/bin/python

import sys
import classad

route_ad = classad.ClassAd(sys.stdin.readline())
separator_line = sys.stdin.readline()
assert separator_line == "------\n"
ad = classad.parseOld(sys.stdin)

ad["Universe"] = 5
ad["GridResource"] = "condor localhost localhost"
if "x509UserProxyFirstFQAN" in ad and "/cms" in ad.eval("x509UserProxyFirstFQAN"):
    ad["AccountingGroup"] = "cms.%s" % ad.eval("Owner")
else:
    ad["AccountingGroup"] = "other.%s" % ad.eval("Owner")

print ad.printOld(),

The above script will read the ad from stdin and change the AccountingGroup
attribute based on the contents of the x509UserProxyFirstFQAN attribute.

Note ClassAds can be constructed from a string or a file object. Each ad can be treated like a python dictionary. Literals are converted to the equivalent python objects; expressions are exposed as objects. For example:

>>> import classad
>>> ad = classad.ClassAd()
>>> expr = classad.ExprTree("2+2")
>>> ad["foo"] = expr
>>> print ad["foo"]
2 + 2
>>> print ad["foo"].eval()
4

Most of the functionality is exposed; see the GitHub project for examples and unit tests. To make the C++ library safe to export to python, some minor semantics have been changed. Sub-ClassAds and Lists are not yet available via python, but shouldn't be too hard to add.

ClassAd Python bindings - maybe not the most life-changing software project in the world. However, they have potential to become one of life's little pleasures for those of us who deal with HTCondor every day!

Resource Isolation in Condor using cgroups

2012-03-10T11:28:00.000-08:00

This is the last in my series on job isolation techniques. It has spanned in postings over the last month, so it may help to recap:

Part I covered process isolation, prevent processes in one job from interacting with other jobs. This has been achievable through POSIX mechanisms for awhile, but the new PID namespaces mechanisms provide improved isolation for jobs running as the same user.
Part II and Part III discussed file isolation using bind mounts and chroots. Condor uses bind mounts to remove access to "problematic" directories such as /tmp. While more complex to setup, chroots allow jobs to run in a completely separate environment as the host and further isolates the job sandbox.

This post will cover resource isolation: preventing jobs from consuming system resources promised to another job.

Condor has always had some crude form of resource isolation. For example, the worker node could be configured to detect when the processes in a job have more CPU time than walltime (a rough indication that more than one core is being used) or when the sum of each process's virtual memory size exceeds the memory requested for the job. When Condor detects too many resources are being consumed, it can take an action such as suspending or killing the job.

This traditional approach is relatively unsatisfactory for a few reasons:

Condor periodically polls to view resource consumption. Any activity between polls is unmonitored.
The metrics Condor traditionally monitors are limited to memory and CPU, where the memory metrics are poor quality for complex jobs. The sum many process's virtual memory size, on a modern Linux box, has little correlation with RAM used and is not particularly meaningful.
We can do little with the system besides detect when resource limits have been violated and kill the job.

We cannot, for example, simply instruct the kernel to reduce the job's memory or CPU usage.
Accordingly, users must ask for peak resource usage, which may be well-above average resource usage, decreasing overall throughput. If the job needs 2GB on average but 4GB for a single second, the user will ask for 4GB; the other 2GB will be un-utilized.

In Linux, the oldest form of resource isolation is processor affinity or CPU pinning: a job can be locked to a specific CPU, and all its processes will inherit the affinity. Because two jobs are locked to separate CPUs, they will never consume each others' CPU resources. CPU pinning is unsatisfactory for reasons similar to memory: jobs can't utilize otherwise-idle CPUs, decreasing potential system throughput. The granularity is also poor: you can't evenly fairshare 25 jobs on a machine with 24 cores as each job must be locked to at least one core. However, it's a step forward - you don't need to kill jobs for using too much CPU - and present in Condor since 7.3.

Newer Linux kernels support cgroups, which allow are structures for managing groups of processes, and provide controllers for managing resources in each cgroup. In Condor 7.7.0, cgroup support was added for measuring resource usage. When enabled, Condor will place each job into a dedicated cgroup for the block-I/O, memory, CPU, and "freezer" controllers. We have implemented two new limiting mechanisms based on the memory and CPU controllers.

The CPU controller provides a mechanism for fairsharing between different cgroups. CPU shares are assigned to jobs based on the "slot weight" (by default, equal to the number of cores the job requested). Thus, a job asking for 2 cores will get an average of 2 cores on a fully loaded system. If there's an idle CPU, it could utilize more than 2 cores; however, it will never get less than what it requested for a significant amount of time. CPU fairsharing provides a much finer granularity than pinning, easily allowing the jobs-to-cores ratio be non-integer.

The memory controller provides two kinds of limits: soft and hard. When soft limits are in place, the job can use an arbitrary amount of RAM until the host runs out of memory (and starts to swap); when this happens, only jobs over their limit are swapped out. With hard limits, the job immediately starts swapping once it hits its RAM limit, regardless of the amount of free memory. Both soft and hard limits default to the amount of memory requested for the job.

Both methods also have disadvantages. Soft limits can cause "well-behaved" processes to wait while the OS frees up RAM from "badly behaving" process. Hard limits can cause large amounts of swapping (for example, if there's a memory leak), decreasing the entire node's disk performance and thus adversely affecting other jobs. In fact, it may be a better use of resources to preempt a heavily-swapping process and reschedule it on another node than let it continue running. There is further room for improvement here in the future.

Regardless, cgroups and controllers provide a solid improvement in resource isolation for Condor, and finish up our series on job isolation. Thanks for reading!

Improving File Isolation with chroot

2012-02-27T11:33:00.000-08:00

In the last post, we examined a new Condor feature called MOUNT_UNDER_SCRATCH that will isolate jobs from each other on the file system by making world-writable directories (such as /tmp and /var/tmp) be unique and isolated per-batch-job.

That work started with the assumption that jobs from the same Unix user don't need to be isolated from each other. This isn't necessarily true on the grid: a single, shared account per-VO is still popular on the OSG. For such VOs, an attacker can gain additional credentials by reading the sandbox of each job running under the same Unix username.

To combat proxy-stealing, we use an old Linux trick called a "chroot". A sysadmin can create a complete copy of the OS inside a directory, and an appropriately-privileged process can change the root of its filesystem ("/") to that directory. In fact, the phrase "changing root" where we get the "chroot" terminology.

For example, suppose the root of the system looks like this:

[root@localhost ~]# ls /
bin     cvmfs         hadoop-data2  home        media  opt   selinux  usr
boot    dev           hadoop-data3  lib         misc   proc  srv      var
cgroup  etc           hadoop-data4  lib64       mnt    root  sys
chroot  hadoop-data1  hadoop.log    lost+found  net    sbin  tmp

The sysadmin can create a copy of the RHEL5 operating system inside a sub-directory; at our site, this is /chroot/sl5-v3/root:

[root@localhost ~]# ls /chroot/sl5-v3/root/
bin   cvmfs  etc   lib    media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt    proc  sbin  srv      tmp  var

Note how the contents of the chroot directory are stripped down relative to the OS - we can remove dangerous binaries, sensitive configurations, or anything else unnecessary to running a job. For example, many common Linux privilege escalation exploits come from the presence of a setuid binary. Such binaries (at, cron, ping) are necessary for managing the host, but not necessary for a running job. By eliminating the setuid binaries from the chroot, a sysadmin can eliminate a common attack vector for processes running inside.

Once the directory is built, we can call chroot and isolate ourselves from the host:

[root@red-d15n6 ~]# chroot /chroot/sl5-v3/root/
bash-3.2# ls /
bin   cvmfs  etc   lib   media  opt   root  selinux  sys  usr
boot  dev    home  lib64  mnt  proc  sbin  srv      tmp  var

Condor, as of 7.7.5, now knows how to invoke the chroot syscall for user jobs. However, as the job sandbox is written outside the chroot, we must somehow transport it inside before starting the job. Bind mounts - discussed last time - come to our rescue. The entire process goes something like this:

Condor, as root, forks off a new child process.
The child uses the unshare system call to place itself in a new filesystem namespace.
The child calls mount to bind-mount the job sandbox inside the chroot. Any other bind mounts - such as /tmp or /var/tmp - are done at this time.
The child will invoke the chroot system call specifying the directory the sysadmin has configured.
The child drops privileges to the target batch system user, then calls exec to start the user process.

With this patch applied, Condor will copy only the job's sandbox forward into the filesystem namespace, meaning the job has access to no other sandbox (as all other sandboxes live outside the private namespace). This successfully isolates jobs from each other's sandboxes, even if they run under the same Unix user!

The Condor feature is referred to as NAMED_CHROOT, as sysadmins can created multiple chroot-capable directories, give them a user-friendly name (such as RHEL5, as opposed to /chroot/sl5-v3/root), and allow user jobs to ask for the directory by the friendly name in their submit file.

In addition to the security benefits, we have found the NAMED_CHROOT feature allows us to run a RHEL5 job on a RHEL6 host without using virtualization; something for the future.

Going back to our original list of directories needing isolation - system temporary directories, job sandbox, shared filesystems, and GRAM directories - we have now isolated everything except the shared filesystems. The option here is simple, if unpleasant: mount the shared file system as read-only. This is the modus operandi for $OSG_APP at many sites, and an acceptable (but not recommended) way to run $OSG_DATA (as $OSG_DATA is optional anyway). It restricts the functionality for the user, but brings us a step closer to our goal of job isolation.

After file isolation, we have one thing left: resource isolation. Again, a topic for the future.

File Isolation using bind mounts and chroots

2012-02-20T08:03:00.000-08:00

The last post ended with a new technique for process-level isolation that unlocks our ability to safely use anonymous accounts and group accounts.

However, that's not "safe enough" for us: the jobs can still interact with each other via the file system. This post examines the directories where jobs can write into, and what can be done to remove this access.

On a typical batch system node, a user can write into the following directories:

System temporary directories: The Linux Filesystem Hierarchy Standard (FHS) provides at least two sticky, world-writable directories, /tmp and /var/tmp. These directories are traditionally unmanaged (user processes can write an uncontrolled amount of data here) and a security issue (symlink attacks and information leaks), even when user separation is in place.
Job Sandbox: This is a directory created by the batch system as a scratch location for the job. The contents of the directory will be cleaned out by the batch system after the job ends. For Condor, any user proxy, executable, or job stage-in files will be copied here prior to the job starting.
Shared Filesystems: For a non-grid site, this is typically at least $HOME, and some other site-specific directory. $HOME is owned by the user running the job. On the OSG, we also have $OSG_APP for application installation (typically read-only for worker nodes) and, optionally, $OSG_DATA for data staging (writable for worker nodes). If they exist and are writable, $OSG_APP/DATA are owned by root and marked as sticky.
GRAM directories: For non-Condor OSG sites, a few user-writable directories are needed to transfer the executable, proxy, and job stage-in files from the gatekeeper to the worker node. These default to $HOME, but can be relocated to any shared filesystem directory. For Condor-based OSG sites, this is a part of the job sandbox.

If user separation is in place and considered sufficient, filesystem isolation is taken care of for shared filesystems, GRAM directories, and the job sandbox. The systemwide temporary directories can be protected by mixing filesystem namespaces and bind mounts.

A process can be launched in its own filesystem namespace; such a process will have a copy of the system mount table. Any change made to the process's mount table will not be seen by the outside system, and will be shared with any child processes.

For example, if the user's home directory is not mounted on the host, the batch system could create a process in a new filesystem namespace and mount the home directory in that namespace. The home directory will be available to the batch job, but to no other process on the filesystem.

When the last process in the filesystem namespace exits, all mounts that are unique to that namespace will be unmounted. In our example, when the batch job exits, the kernel will unmount the home directory.

A bind mount makes a file or directory visible at another place in the filesystem - I think of it as mirroring the directory elsewhere. We can take the job sandbox directory, create a sub-directory, and bind-mount the sub-directory over /tmp. The process is mostly equivalent to the following shell commands (where $_CONDOR_SCRATCH_DIR is the location of the Condor job sandbox) in a filesystem namespace:

mkdir $_CONDOR_SCRATCH_DIR/tmp
mount --bind $_CONDOR_SCRATCH_DIR/tmp /tmp

Afterward, any files a process creates in /tmp will actually be stored in $_CONDOR_SCRATCH_DIR/tmp - and cleaned up accordingly by Condor on job exit. Any system process not in the job will not be able to see or otherwise interfere with the contents of the job's /tmp unless it can write into $_CONDOR_SCRATCH_DIR.

Condor refers to this feature as MOUNT_UNDER_SCRATCH, and will be a part of the 7.7.5 release. This will be an admin-specified list of directories on the worker node. With it, the job will have a private copy of these directories, which will be backed by $_CONDOR_SCRATCH_DIR. The contents - and size - of these will be managed by Condor, just like anything else in the scratch directory.

If user separation is unavailable or not considered sufficient (if there are, for example, group accounts), an additional layer of isolation is needed to protect the job sandbox. A topic for a future day!

Job Isolation in Condor

2012-02-14T09:46:00.000-08:00

I'd like to share a few exciting new features under construction for Condor 7.7.6 (or 7.9.0, as it may be).

I've been working hard to improve the job isolation techniques available in Condor. My dictionary defines the verb "to isolate" as "to be or remain alone or apart from others"; when applied to the Condor context, we'd like to isolate each job from the others. We'll define process isolation as the inability of a process running in a batch job to interfere with a process not a part of the job. Interfering with processes on Linux, loosely defined, means the sending of POSIX signals, taking control via the ptrace mechanism, or writing into the other process's memory.

Process isolation is only one aspect of job isolation. Job isolation also includes the inability to interfere with other jobs' files (file isolation) and not being able to consume others' system resources such as CPU, memory, or disk (resource isolation).

In Condor, process isolation has historically been accomplished via one of two mechanisms:

Submitting user. Jobs from Alice and Bob will be submitted as the unix users alice and bob, respectively. In this model, the jobs running on the worker node will be run as users alice and bob, respectively. The processes in the job running under user bob are protected from the processes in the job running as user alice via traditional POSIX security mechanisms.

This model makes the assumption that jobs submitted by the same user do not need isolation from each other. In other words, there shouldn't be any shared user accounts!
This model also assumes the submit host and the worker node share a common user namespace. This can be more difficult to accomplish than it sounds: if the submit host has thousands of unique users, we must make sure each functions on the worker node. If the submit host is on a remote site with a different user namespace from the worker node, this may not be easily achievable!

Per-slot users. Each "slot" (roughly corresponding to a CPU) in condor is assigned a unique unix user. The job currently running in that slot is run under the associated username.

This solves the "gotchas" noted above with the submitting user isolation model.
This is difficult to accomplish in-practice if the job wants to utilize a filesystem shared between the submit and worker nodes. The filesystem security is based on two users having distinct Unix user names; in this model, there's no way to mark your files as only readable by your own jobs.

Notice both techniques require on user isolation to accomplish process isolation. Condor has an oft-overlooked third mode:

Mapping remote users to nobody. In this mode, local users (where the site admin can define the meaning of "local") get mapped to the submit host usernames, but non-local users all get mapped to user nobody - the traditional unprivileged user on Linux.

Local users can access all their files, but remote users only get access to the batch resources - no shared file systems.

Unfortunately, this is not a very secure mode as, according to the manual, the nobody account "... may also be used by other Condor jobs running on the same machine, if it is a multi-processor machine"; not very handy advice in an age where your cell phone likely is a multi-processor machine!

This third mode is particularly attractive to us - we can avoid filesystem issues for our local users, but no longer have to create the thousands of accounts in our LDAP database for remote users. However, since jobs from remote users run under the same unix user account, the traditional security mechanism of user separation does not apply - we need a new technique!

Enter PID namespaces, a new separation technique introduced in kernel 2.6.24. By passing an additional flag when creating a new process, the kernel will assign an additional process ID (PID) to the child process. The child will believe itself to be PID 1 (that is, when the child calls getpid(), it returns 1), while the processes in the parent's namespace will see a different PID. The child will be able to spawn additional processes - all will be stuck in the same inner namespace - that similarly have an inner PID different from the outer one.

Processes within the namespace can only see and interfere (send signals, ptrace, etc) with other processes inside the namespace. By launching the new job in its own PID namespace, Condor can achieve process isolation without user isolation: the job processes are isolated from all other processes on the system.

Perhaps the best way to visualize the impact of PID namespaces in the job is to examine the output of ps:

[bbockelm@localhost condor]$ condor_run ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bbockelm     1  0.0  0.0 114132  1236 ?        SNs  11:42   0:00 /bin/bash /home/bbockelm/.condor_run.3672
bbockelm     2  0.0  0.0 115660  1080 ?        RN   11:42   0:00 ps faux

Only two processes can be seen from within the job - the shell executing the job script and "ps" itself.

Releasing a PID namespaces-enabled Condor is an ongoing effort: https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1959; I've recently re-designed the patch to be far less intrusive on the Condor internals by switching from the glibc clone() call to the clone syscall. I am hopeful it will make it in the 7.7.6 / 7.9.0 timescale.

From a process isolation point-of-view, with this patch, it now is safe to run jobs as user "nobody" or re-introduce the idea of shared "group accounts". For example, we could map all CMS users to a single "cmsuser" account without having to worry about these becoming a vector for virus infection.

However, the story of job isolation does not end with PID namespaces. Stay tuned to find out how we are tackling file and resource isolation!

openstack - update

2012-01-27T10:29:00.000-08:00

Last time I was able to deploy an image. Next step would be to list it and then run. But I have hit problems.

To list images I run command:

euca-describe-images

which hangs up forever and after long time exits with message "connection reset by peer".

I have disabled iptables to eliminate firewall issues. No help.

All manuals assume that euca-describe-images should simply run and do not give instruction what to do if it does not.

Following Josh's advice I did:

strace -o edi_output -f -ff euca-describe-images

and then I looked into the output files. It seems that there might be two problems:

Some euca2ools files are missing - in particular the .eucarc configuration file.
There are messages about missing python files, like for example "open("/usr/lib64/python2.6/site-packages/gtk-2.0/org.so", O_RDONLY) = -1 ENOENT (No such file or directory)" (There are manu more like that).

So it seems that the eucatools installation described in previous posts may be not complete - and it missed some key files. Or python (which we already know had to be patched) is not OK. Or both.

That's all I know for now.

How to register an image in openstack

2012-01-10T08:47:00.000-08:00

After having installed and configured the worker and controller nodes of the openstack testbed we would like to upload images into it.

First I downloaded some images to /root/images on controller node. One is from Xin and another one is a minimal image for testing I got from the net. I have no idea what are they worth.

Then I tried to follow the instructions

http://docs.openstack.org/cactus/openstack-compute/admin/content/part-ii-getting-virtual-machines.html

which go like this:

image="ubuntu1010-UEC-localuser-image.tar.gz"
wget http://c0179148.cdn1.cloudfiles.rackspacecloud.com/ubuntu1010-UEC-localuser-image.tar.gz
uec-publish-tarball $image [bucket-name] [hardware-arch]

and I could not find where does the

uec-publish-tarball

command comes from. Finally I realized that it comes from Ubuntu and the manual became Ubuntu specific without saying it explicitly.

So I tried different approach.

cd /root/images

glance add name="My Image" < sl61-kvm.tar.bz2 # the image I got from Xin

The command responded that the image got Id=1, which is a good sign.

Then I did:

glance show 1

and got:

URI: http://0.0.0.0/images/1
Id: 1
Public: No
Name: My Image
Size: 199737477
Location: file:///var/lib/glance/images/1
Disk format: raw
Container format: ovf

Which suggests that the file is in the system. But when I tried:

glance index

it said:

no public images found

So I tried to register it again:

glance add name="My Image" is_public=true < sl61-kvm.tar.bz2
Added new image with ID: 2

I tried to list:

glance index
Found 1 public images...
ID Name Disk Format Container Format Size
---------------- ------------------------------ -------------------- -------------------- --------------
2 My Image raw ovf 199737477

So it seems we have uploaded an image to the system.

Now I have to figure out how to run it.

How to configure worker node - part 2

2012-01-06T07:17:00.000-08:00

Compute node configuration - continued

We execute the following commands:

This command is supposed to synchronize the database:

/usr/bin/nova-manage db sync 
Now we have to create users and projects. We call both users and projects "nova"

/usr/bin/nova-manage user admin nova
/usr/bin/nova-manage project create nova nova 
/usr/bin/nova-manage network create 192.168.0.0/24 1 256

We check that users and projects were created correctly:

/usr/bin/nova-manage project list
nova

/usr/bin/nova-manage user list
nova

Create Certifications

On the controller node execute

mkdir –p /root/creds

/usr/bin/python /usr/bin/nova-manage project zipfile nova nova /root/creds/novacreds.zip

If you encounter a python error, then apply the python patch described few posts earlier.

Create /root/creds on the compute node and copy the
novacreds.zip file there. Then unpack it

unzip /root/creds/novacreds.zip -d /root/creds/

A few files will appear, among them
/root/creds/novarc . This file needs to be appended to .bashrc, but there is a catch:
first line of the file has an error and has to be replaced:


Original line:

NOVA_KEY_DIR=$(pushd $(dirname $BASH_SOURCE)>/dev/null; pwd; popd>/dev/null)

has to be replaced with

NOVA_KEY_DIR=~/creds

The content of novarc file now is

NOVA_KEY_DIR=~/creds

export EC2_ACCESS_KEY="XXXXXXXXXXXXXXXXXXXXXXXX:nova"
export EC2_SECRET_KEY="XXXXXXXXXXXXXXXXXXXXXXXX"
export EC2_URL="http://130.199.148.53:8773/services/Cloud"
export S3_URL="http://130.199.148.53:3333"
export EC2_USER_ID=42 # nova does not use user id, but bundling requires it
export EC2_PRIVATE_KEY=${NOVA_KEY_DIR}/pk.pem
export EC2_CERT=${NOVA_KEY_DIR}/cert.pem
export NOVA_CERT=${NOVA_KEY_DIR}/cacert.pem
export EUCALYPTUS_CERT=${NOVA_CERT} # euca-bundle-image seems to require this set
alias ec2-bundle-image="ec2-bundle-image --cert ${EC2_CERT} --privatekey ${EC2_PRIVATE_KEY} --user 42 --ec2cert ${NOVA_CERT}"
alias ec2-upload-bundle="ec2-upload-bundle -a ${EC2_ACCESS_KEY} -s ${EC2_SECRET_KEY} --url ${S3_URL} --ec2cert ${NOVA_CERT}"
export NOVA_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXXX"
export NOVA_USERNAME="nova"
export NOVA_URL="http://130.199.148.53:8774/v1.0/"


Where "XXXX.." strings denote keys which I do not post here, for security.

The content of novarc file should now be added to bashrc:

cat /root/creds/novarc >> ~/.bashrc source ~/.bashrc

This should be done both on compute and controller nodes.

Enable access to worker node

First unset a proxy and then do:

euca-authorize -P icmp -t -1:-1 default euca-authorize -P tcp -p 22 default

How to configure worker node

2012-01-05T12:49:00.000-08:00

In the following I will describe how to configure the worker node. I assume that the worker node has been already installed following the instructions posted on this blog.

Firs of all, before we start, we still need to add nova-network (it has not been installed so far).

Do:

yum install openstack-nova-network

Once this is done, we can go on and edit the /etc/nova/nova.conf file.

First, add to the file the option

--daemonize=1

The relevant switches are:

--sql_connection
--s3_host
--rabbit_host
--ec2_api
--ec2_url
--fixed_range
--network_size

In the end the configuration file should look like:

--auth_driver=nova.auth.dbdriver.DbDriver
--buckets_path=/var/lib/nova/buckets
--ca_path=/var/lib/nova/CA
--cc_host=
--credentials_template=/usr/share/nova/novarc.template
--daemonize=1
--dhcpbridge_flagfile=/etc/nova/nova.conf
--dhcpbridge=/usr/bin/nova-dhcpbridge
--ec2_api=130.199.148.53
--ec2_url=http://130.199.148.53:8773/services/Cloud
--fixed_range=192.168.0.0/16
--glance_host=
--glance_port=9292
--image_service=nova.image.glance.GlanceImageService
--images_path=/var/lib/nova/images
--injected_network_template=/usr/share/nova/interfaces.rhel.template
--instances_path=/var/lib/nova/instances
--keys_path=/var/lib/nova/keys
--libvirt_type=kvm
--libvirt_xml_template=/usr/share/nova/libvirt.xml.template
--lock_path=/var/lib/nova/tmp
--logdir=/var/log/nova
--logging_context_format_string=%(asctime)s %(name)s: %(levelname)s [%(request_id)s %(user)s %(project)s] %(message)s
--logging_debug_format_suffix=
--logging_default_format_string=%(asctime)s %(name)s: %(message)s
--network_manager=nova.network.manager.VlanManager
--networks_path=/var/lib/nova/networks
--network_size=8
--node_availability_zone=nova
--rabbit_host=130.199.148.53
--routing_source_ip=130.199.148.53
--s3_host=130.199.148.53
--scheduler_driver=nova.scheduler.zone.ZoneScheduler
--sql_connection=mysql://{USER}:{PWD}@130.199.148.53/{DATABASE}
--state_path=/var/lib/nova
--use_cow_images=true
--use_ipv6=false
--use_s3=true
--use_syslog=false
--verbose=false
--vpn_client_template=/usr/share/nova/client.ovpn.template

where {USER},{PWD} and {DATABASE} denote nova database user, pasword and database name.

Now go to the controller node and open the following ports for incoming connections: 3333,3306,5672,8773,8000.

Go back to worker node and prepare /root/bin/openstack-init.sh script with the following content:

#!/bin/bash
for n in ajax-console-proxy compute vncproxy network; do
service openstack-nova-$n $@;
done

Then run

/root/bin/openstack-init.sh stop
Stopping OpenStack Nova Web-based serial console proxy: [ OK ]
Stopping OpenStack Nova Compute Worker: [ OK ]
Stopping OpenStack Nova VNC Proxy: [ OK ]
Stopping OpenStack Nova Network Controller: [ OK ]
[root@gridreserve30 compute]# /root/bin/openstack-init.sh start
Starting OpenStack Nova Web-based serial console proxy: [ OK ]
Starting OpenStack Nova Compute Worker: [ OK ]
Starting OpenStack Nova VNC Proxy: [ OK ]
Starting OpenStack Nova Network Controller: [ OK ]

to be continued...

What's the hold-up?

2011-12-29T17:12:00.000-08:00

Do you have the following diagram memorized?

If your site runs Condor, you probably should. It shows the states of the condor_startd, the activities within the state, and the transitions between them. If you want to have jobs reliably pre-empted (or is that killed? Or vacated?) from the worker node for something like memory usage, a clear understanding is required.

However, the 30 state transitions might be a bit much for some site admins who just want to kill jobs that go over a memory limit. In such a case, admins can utilize the SYSTEM_PERIODIC_REMOVE or the SYSTEM_PERIODIC_HOLD configuration parameters on the condor_schedd to respectively remove or hold jobs.

These expressions periodically evaluate the schedd's copy of the job ClassAd (by default, once every 60s); if they evaluate to true for a given job, they will remove or hold it. This will almost immediately preempt execution on the worker node.

[Note: While effective and simple, these are not the best way to accomplish these sort of policies! As the worker node may talk to multiple schedd's (via flocking, or just through a complex pool with many schedd's), it's best to express the node's preferences locally.]

At HCC, the periodic hold and release policy looks like this:

# hold jobs using absurd amounts of disk (100+ GB)
SYSTEM_PERIODIC_HOLD = \
   (JobStatus == 1 || JobStatus == 2) && ((DiskUsage > 100000000 || ResidentSetSize > 1600000))

# forceful removal of running after 2 days, held jobs after 6 hours,
# and anything trying to run more than 10 times
SYSTEM_PERIODIC_REMOVE = \
   (JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*6) || \
   (JobStatus == 2 && CurrentTime - EnteredCurrentStatus > 3600*24*2) || \
   (JobStatus == 5 && JobRunCount >= 10) || \
   (JobStatus == 5 && HoldReasonCode =?= 14 && HoldReasonSubCode =?= 2)

We place anything on hold that goes over some pre-defined resource limit (disk usage or memory usage). Jobs are removed if they have been on hold for a long time, have run for too long, have restarted too many times, or are missing their input files.

Note that this is a flat policy for the cluster - heterogeneous nodes with larges amounts of RAM per core would not be well-utilized. We could tweak this by having users utilize the RequestMemory attribute to their job's ad (defaulting to 1.6GB), place into the Requirements that the slot have sufficient memory, and have the node only accept jobs that request memory below a certain threshold. The expression above could then be tweaked to hold jobs where (ResidentSetSize > RequestMemory). Perhaps more on that in the future if we go this route.

While the SYSTEM_PERIODIC_* expressions are useful, Dan Bradley recently introduce me to the SYSTEM_PERIODIC_*_REASON parameter. This allows you to build a custom hold message for the user whose jobs you're about to interrupt. The expression is evaluated within the context of the job's ad, and the resulting string is placed in the job's HOLD_REASON. As an example, previously, the hold message was something bland and generic:

The SYSTEM_PERIODIC_HOLD expression evaluated to true.

Why did it evaluate to true? Was it memory or disk usage? When it was held, how bad was the disk/memory usage? These things can get lost in the system. Oops. We added the following to our schedd's configuration:

# Report why the job went on hold.
SYSTEM_PERIODIC_HOLD_REASON = \
   strcat("Job in status ", JobStatus, \
   " put on hold by SYSTEM_PERIODIC_HOLD due to ", \
   ifThenElse(isUndefined(DiskUsage) || DiskUsage < 100000000, \
      strcat("memory usage ", ResidentSetSize), \
      strcat("disk usage ", DiskUsage)), ".")

Now, we have beautiful error messages in the user's logs explaining the issue:

Job in status 2 put on hold by SYSTEM_PERIODIC_HOLD due to memory usage 1620340."

One less thing to get confused about!

A simple iRODS Micro-Service

2011-12-23T06:14:00.000-08:00

Introduction

The goal I had for this task was to identify and understand the steps and configurations involved in writing a micro-service and seeing it in action - for details regarding iRODS please refer to documentation at https://www.iRODS.org/. The micro-service that I wrote is very simplistic (it writes a hello world message to the system log), however it serves its purpose by providing an overview of steps that will be involved in writing a useful micro-service.

Before I document the configurations and codes involved in creating and registering the new micro-service let’s look at figure 1.

Figure 1 shows a high level view of invocation of a micro-service by the iRODS rules engine. One way of looking at the micro-service and the iRODS rule engine is to think of it as an event based triggering system that can perform ‘operations’ on the data objects, and/or external resources. The micro-services are registered in iRODS rule definitions and the rule engine invokes them based on the condition specified for that rule. For a list of places in the iRODS workflow where a micro-service may be triggered please visit: https://www.irods.org/index.php/Default_iRODS_Rules.

Also you may refer to https://www.iRODS.org/index.php/Rule_Engine for a detailed diagram of a micro-service invocation.

Figure 2 above shows the communication between the iRODS rule engine and a micro-service. A simplistic view of the communication layers is that the rule engine calls a defined C procedure, which exposes its functionality through an interface (commonly prefixed with msi). The arguments to the procedure are passed through a structure named msParam_t that is defined below:

typedef struct MsParam {
  char *label;
  char *type;         /* this is the name of the packing instruction in
                       * rodsPackTable.h */
  void *inOutStruct;
  bytesBuf_t *inpOutBuf;
} msParam_t;

Writing the micro-service

Figure 3 shows the steps involved in creating a new micro-service:

Write the C procedure

The C code below (lets call it test.c) has a function writemessage that writes a message to the system log. There is an interface to the function named msiWritemessage which exposes the writemessage function. The msi function takes a list of arguments of type msParam_t and a last argument of type ruleExecInfo_t for the result of the operation.

#include <stdio.h>
#include <unistd.h>
#include <syslog.h>
#include <string.h>
#include "apiHeaderAll.h"


void writemessage(char arg1[], char arg2[]);
int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei);


void writemessage(char arg1[], char arg2[]) {
    openlog("slog", LOG_PID|LOG_CONS, LOG_USER);
    syslog(LOG_INFO, "%s %s from micro-service", arg1, arg2);
    closelog();
}

int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei)
{
 char *in1;
 int *in2;
 RE_TEST_MACRO ("    Calling Procedure");
 // the above line is needed for loop back testing using irule -i option
 if ( strcmp( mParg1->type, STR_MS_T ) == 0 )
 {
    in1 = (char*) mParg1->inOutStruct;
 }
 if ( strcmp( mParg2->type, INT_MS_T ) == 0 )
 {
    in2 = (int*) mParg2->inOutStruct;
 }
 writemessage(in1, in1);
 return rei->status;
}

Next I will make a folder structure in the module folder of iRODS home for placing this micro-service and copy a few files from an example properties module and modify them to fit the test.c micro-service

cd ~irods
mkdir modules/HCC
cd modules/HCC

mkdir microservices
mkdir rules
mkdir lib
mkdir clients
mkdir servers

mkdir microservices/src
mkdir microservices/include
mkdir microservices/obj
cp ../properties/Makefile .
cp ../properties/info.txt .

Listed below is my working copy of Makefile and the info.txt

#Makefile
ifndef buildDir
buildDir = $(CURDIR)/../..
endif

include $(buildDir)/config/config.mk
include $(buildDir)/config/platform.mk
include $(buildDir)/config/directories.mk
include $(buildDir)/config/common.mk

#
# Directories
#
MSObjDir =    $(modulesDir)/HCC/microservices/obj
MSSrcDir =    $(modulesDir)/HCC/microservices/src
MSIncDir =    $(modulesDir)/HCC/microservices/include

# Source files

OBJECTS =    $(MSObjDir)/test.o


# Compile and link flags
#
INCLUDES +=    $(INCLUDE_FLAGS) $(LIB_INCLUDES) $(SVR_INCLUDES)
CFLAGS_OPTIONS := $(CFLAGS) $(MY_CFLAG)
CFLAGS =    $(CFLAGS_OPTIONS) $(INCLUDES) $(MODULE_CFLAGS)

.PHONY: all server client microservices clean
.PHONY: server_ldflags client_ldflags server_cflags client_cflags
.PHONY: print_cflags

# Build everytying
all:    microservices
    @true

# List module's objects and needed libs for inclusion in clients
client_ldflags:
    @true

# List module's includes for inclusion in the clients
client_cflags:
    @true

# List module's objects and needed libs for inclusion in the server
server_ldflags:
    @echo $(OBJECTS) $(LIBS)

# List module's includes for inclusion in the server
server_cflags:
    @echo $(INCLUDE_FLAGS)

# Build microservices
microservices:    print_cflags $(OBJECTS)

# Build client additions
client:
    @true

# Build server additions
server:
    @true

# Build rules
rules:
    @true

# Clean
clean:
    @echo "Clean image module..."
    rm -rf $(MSObjDir)/*.o


# Show compile flags
print_cflags:
    @echo "Compile flags:"
    @echo "    $(CFLAGS_OPTIONS)"

# Compile targets
#
$(OBJECTS): $(MSObjDir)/%.o: $(MSSrcDir)/%.c $(DEPEND)
    @echo "Compile image module `basename $@`..."
    @$(CC) -c $(CFLAGS) -o $@ $<

info.txt

Name:        HCC
Brief:        HCC Test microservice
Description:    HCC Test microservice.
Dependencies:
Enabled:    yes
Creator:    Ashu Guru
Created:    December 2011
License:    BSD

In the next step I will define the micro-service header and micro-service table files so that the iRODS can be configured with the new micro-service. This is done in the folder microservices/include. In this example there is no header for this code so I have left the header file blank; in the micro-service table file I have the entry for the table definition. The specifics to note below are that the first argument is the label of the micro-service, the second argument is the count of input arguments (do not count the ruleExecInfo _t argument) of the msi interface and the third argument is the name of the msi interface function.

File microservices/include/microservices.table

{ "msiWritemessage",2,(funcPtr) msiWritemessage },

Following is the directory tree structure for the HCC module that I have so far:

bash-4.1$ pwd

/opt/iRODS/modules
bash-4.1$ tree HCC
HCC
├── clients
├── info.txt
├── lib
├── Makefile
├── microservices
│   ├── include
│   │   ├── microservices.header
│   │   ├── microservices.table
│   ├── obj
│   └── src
│       ├── test.c
├── rules
└── servers

Next I will make an entry for enabling the new module (this micro-service), this is done in the file ~irods/config/config.mk so that the iRODS Makefile can include the new micro-service for build. To do this simply add the module folder name (in my case HCC) to the variable MODULES.

Compile and test

cd ~irods/modules/<YOURMODULENAME>
make

The above commands should result in creation of an object file in the micro-service/obj folder. I am going to test the micro-service manually first, to accomplish this I will create a client side rule file in the folder ~irods/ clients/icommands/test/rules. I have named the file aguru.ir and following are the contents of the file:

aguruTest||msiWritemessage(*A,*B)|nop
*A=helloworld%*B=testing

The first line in file is the rules definition and the second line are the input parameters. To test the micro-service I will invoke the micro-service which will then write a message to the system log (see figure below).

Recompile iRODS

Before this step I must make the entries for the headers and the msi table in the iRODS main micro-service action table (i.e. file ~irods/server/re/include/reAction.h). This should be done using the following commands:

rm server/re/include/reAction.h
make reaction

However, I had to manually add the code segment below to the file server/re/include/reAction.h file to accomplish that:

int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei);

Finally, recompile iRODS

cd ~irods
make test_flags
make modules
./irodsctl stop
make clean
make
./irodsctl start
./irodsctl status

Register Micro-service and Test

In this step we define a rule that will trigger the micro-service when a new data object is uploaded to iRODS. Open the file ~irods/server/config/reConfigs/core.re and add the following line the Test Rules section.

acPostProcForPut {msiWritemessage("HelloWorld","String 2"); }

That is it… if now I put (iput) any file into iRODS a message is added to the /var/log/messages file on the iRODS server. Please note that the above rule is not filtering a particular occurrence but is a catchall rule that applies to all put events.

References:
https://www.irods.org/
http://www.wrg.york.ac.uk/iread/compiling-and-running-irods-with-micros-services
http://technical.bestgrid.org/index.php/IRODS_deployment_plan

How to create openstack controller

2011-12-15T11:47:00.001-08:00

As before, the "official" instructions on which our procedure is based are here:

http://docs.openstack.org/cactus/openstack-compute/admin/content/installing-openstack-compute-on-rhel6.html

First setup the repository:

wget http://yum.griddynamics.net/yum/cactus/openstack/openstack-repo-2011.2-1.el6.noarch.rpm
rpm -ivh openstack-repo-2011.2-1.el6.noarch.rpm

Then install openstack and dependencies

yum install libvirt
chkconfig libvirtd on
/etc/init.d/libvirtd start
yum install euca2ools openstack
nova-{api,compute,network,objectstore,scheduler,volume} openstack-nova-cc-config openstack-glance

Start services:

service mysqld start
chkconfig mysqld on
service rabbitmq-server start
chkconfig rabbitmq-server on

Setup database authorisations. First set up root password:

mysqladmin -uroot password

Now, to automate the procedure create an executable shell script

openstack-db-setup.sh

with the following content (fill the relevant user name and password fields as well as the IP's):

#!/bin/bash

DB_NAME=nova
DB_USER=
DB_PASS=
PWD=

#CC_HOST="A.B.C.D" # IPv4 address
CC_HOST="130.199.148.53" # IPv4 address, fill your own
#HOSTS='node1 node2 node3' # compute nodes list
HOSTS='130.199.148.54' # compute nodes list, fill your own

mysqladmin -uroot -p$PWD -f drop nova
mysqladmin -uroot -p$PWD create nova

for h in $HOSTS localhost; do
echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO '$DB_USER'@'$h' IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql
done
echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO $DB_USER IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql
echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO root IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql

And now execute this script:

./openstack-db-setup.sh

Create db schema

nova-manage db sync

Now comes point which is not in the "official" instructions. The installation will not work unless you patch your python:

patch -p0 < rhel6-nova-network-patch.diff

Create logical volumes:

lvcreate -L 1G --name test nova-volumes

For your convenience create an openstack startup shell script openstack-init.sh

Here is its content:

#!/bin/bash
for n in api compute network objectstore scheduler volume; do
service openstack-nova-$n $@;
done
service openstack-glance-api $@

And finally we are ready to start openstack:

openstack-init.sh start

With fingers crossed you should get

Starting OpenStack Nova API Server: [ OK ]
Starting OpenStack Nova Compute Worker: [ OK ]
Starting OpenStack Nova Network Controller: [ OK ]
Starting OpenStack Nova Object Storage: [ OK ]
Starting OpenStack Nova Scheduler: [ OK ]
Starting OpenStack Nova Volume Worker: [ OK ]
Starting OpenStack Glance API Server: [ OK ]

Now we need to configure and customize the installation which is another story for another day...

./openstack-init.sh start

If everything goes fine

Starting OpenStack Nova API Server: [ OK ]
Starting OpenStack Nova Compute Worker: [ OK ]
Starting OpenStack Nova Network Controller: [ OK ]
Starting OpenStack Nova Object Storage: [ OK ]
Starting OpenStack Nova Scheduler: [ OK ]
Starting OpenStack Nova Volume Worker: [ OK ]
Starting OpenStack Glance API Server: [ OK ]

How to create openstack worker node

2011-12-15T11:40:00.001-08:00

The "official" instructions how to install openstack components are located here:

http://docs.openstack.org/cactus/openstack-compute/admin/content/installing-openstack-compute-on-rhel6.html

Unfortunately they are not very clear and miss some key points. Below is summary of our installation procedure.

First of all, let us install worker node.

wget http://yum.griddynamics.net/yum/cactus/openstack/openstack-repo-2011.2-1.el6.noarch.rpm
rpm -ivh openstack-repo-2011.2-1.el6.noarch.rpm
yum install libvirt
chkconfig libvirtd on
/etc/init.d/libvirtd start
yum install openstack-nova-compute openstack-nova-compute-config
service openstack-nova-compute start

If everything goes fine you should see

Starting OpenStack Nova Compute Worker: [ OK ]

Network Accounting for Condor

2011-12-08T17:03:00.000-08:00

It's been a long time since the August post describing how to set up manual network accounting for a process. We now have a solution integrated into Condor and available on github. It requires a bit to understand how it works, so I've put together a series of diagrams to illustrate it.

First, we start off with the lowly condor_starter on any worker node with an network connection (to simplify things, I didn't draw the other condor processes involved):

By default, all processes on the node are in the same network namespace (labelled the "System Network Namespace" in this diagram). We denote the network interface with a box, and assume it has address 192.168.0.1.

Next, the starter will create a pair of virtual ethernet devices. We will refer to them as pipe devices, because any byte written into one will come out of the other - just how a venerable Unix pipe works:

By default, the network pipes are in a down state and have no IP address associated with them. Not very useful! At this point, we have some decisions to make: how should the network pipe device be presented to the network? Should it be networked at layer 3, using NAT to route packets? Or should we bridge it at layer 2, allowing the device to have a public IP address?

Really, it's up to the site, but we assume most sites will want to take the NAT approach: the public IP address might seem useful, but would require a public IP for each job. To allow customization, all the routing is done by a helper script, but provide a default implementation for NAT. The script:

Takes two arguments, a unique "job identifier" and the name of the network pipe device.
Is responsible for setting up any routing required for the device.
Must create an iptables chain using the same name of the "job identifier".

Each rule in the chain will record the number of bytes matched; at the end of the job, these will be reported in the job ClassAd using an attribute name identical to the comment on the rule.

On stdout, returns the IP address the internal network pipe should use.

Additionally, the Condor provides a cleanup script does the inverse of the setup script. The result looks something like this:

Next, the starter forks a separate process in a new network namespace using the clone() call with the CLONE_NEWNET flag. Notice that, by default, no network devices are accessible in the new namespace:

Next, the external starter will pass one side of the pipe to the other namespace; the internal stater will do some minimal configuration of the device (default route, IP address, set the device to the "up" status):

Finally, the starter exec's to the job. Whenever the job does any network operations, the bytes are routed via the internal network pipe, come out the external network pipe, and then are NAT'd to the physical network device before exiting the machine.

As mentioned, the whole point of the exercise is to do network accounting. Since all packets go through one device, Condor can read out all the activity via iptables. The "helper script" above will create a unique chain per job. This allows some level of flexibility; for example, the chain below allows us to distinguish between on-campus and off-campus packets:

Chain JOB_12345 (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  veth0  em1     anywhere             129.93.0.0/16        /* OutgoingInternal */
    0     0 ACCEPT     all  --  veth0  em1     anywhere            !129.93.0.0/16        /* OutgoingExternal */
    0     0 ACCEPT     all  --  em1    veth0   129.93.0.0/16        anywhere             state RELATED,ESTABLISHED /* IncomingInternal */
    0     0 ACCEPT     all  --  em1    veth0  !129.93.0.0/16        anywhere             state RELATED,ESTABLISHED /* IncomingExternal */
    0     0 REJECT     all  --  any    any     anywhere             anywhere             reject-with icmp-port-unreachable

Thus, the resulting ClassAd history from this job will have an attribute for NetworkOutgoingInternal, NetworkOutgoingExternal, NetworkIncomingInternal, and NetworkIncomingInternal. We have an updated Condor Gratia probe that looks for Network* attributes and reports them appropriately to the accounting database.

Thus, we have byte-level network, allowing us to answer the age-old question of "how much would a CMS T2 cost on Amazon EC2?". Or perhaps we could answer "how much is a currently running job going to cost me?" Matt has pointed out the network setup callout could be used to implement security zones, isolating (or QoS'ing) jobs of certain users at the network level. There are quite a few possibilities!

We'll definitely be returning to this work mid-2012 when the local T2 is based on SL6, and this patch can be put into production. There will be some further engagement with the Condor team to see if they're interested in taking the patch. The Gratia probe work to manage network information will be interesting upstream too. Finally, I encourage interested readers to take a look at the github branch. The patch itself is a tour-de-force of several dark corners of Linux systems programming (involves using clone, synchronization between processes with pipes, sending messages to the kernel via netlink to configure the routing, and reading out iptables configurations using C). It was very rewarding to implement!

Details on glexec improvements

2011-12-01T17:46:00.000-08:00

My last blog post gave a quick overview of why glexec exists, what issues folks run into, and what we did to improve it. Let's go into some details.

How Condor Update Works
The lcmaps-plugin-condor-update package contains the modules necessary to advertise the payload certificate of the last glexec invocation in the pilot's ClassAd. The concept is simple - the implementation is a bit tricky.

For a long time, Condor has had a command-line tool called condor_advertise for awhile; it allows an admin to hand-advertise updates to ads in the collector. Unfortunately, that's not quite what we need here: we want to update the job ad in the schedd, while condor_advertise typically updates the machine ad in the collector. Close, but no cigar.

There's a lesser-known utility called condor_chirp that we can use. Typically, condor_chirp is used to do I/O between the schedd and the starter (for example, you can pull/push files on demand in the middle of the job), but it can also update the job's ad in the schedd. The syntax is simple:

condor_chirp ATTR_NAME ATTR_VAL

(look at the clever things Matt does with condor_chirp). As condor_chirp allows additional access to the schedd, the user must explicitly request it in the job ad. If you want to try it out, you must add the following line into your submit file:

+WantIOProxy=TRUE

To work, chirp must know how to contact the starter and have access to the "magic cookie"; these are located inside the $_CONDOR_SCRATCH_DIR, as set by Condor in the initial batch process. As the glexec plugin runs as root (glexec must be setuid root to launch a process as a different UID), we must guard against being fooled by the invoking user.
Accordingly, the plugin uses /proc to read the parentage of the process tree until it finds a process owned by root. If this is not init, it is assumed the process is the condor_starter, and the job's $_CONDOR_SCRATCH_DIR can be deduced from the $CWD and the PID of the starter. Since we only rely on information from root-owned processes, we can be fairly sure this is the correct scratch directory. As a further safeguard, before invoking condor_chirp, the plugin drops privilege to that of the invoking user. Along with the other security guarantees provided by glexec, we have confidence that we are reading the correct chirp configuration and are not allowing the invoker to increase its privileges.

Once we know how to invoke condor_chirp, the rest of the process is all downhill. glexec internally knows the payload's DN, the payload Unix user, and does the equivalent of the following:

condor_chirp set_job_attr glexec_user "hcc"
condor_chirp set_job_attr glexec_x509userproxysubject "/DC=org/DC=cilogon/C=US/O=University of Nebraska-Lincoln/CN=Brian Bockelman A621"
condor_chirp set_job_attr glexec_time 1322761868

condor_chirp writes the data into the starter, which then updates the shadow, then the schedd (some of the gory details are covered in the Condor wiki).

The diagram below illustrates the data flow:

Putting this into Play
If you really want to get messy, you can check out the source code from Subversion at:

svn://t2.unl.edu/brian/lcmaps-plugins-condor-update

(web view)

The current version of the plugin is 0.0.2. It's available in Koji, or via yum in the osg-development repository:

yum install --enablerepo=osg-development lcmaps-plugins-condor-update

(you must already have the osg-release RPM installed and glexec otherwise configured).

After installing it, you need to update the /etc/lcmaps.db configuration file on the worker node to invoke the condor-update module. In the top half, I add:

condor_updates = "lcmaps_condor_update.mod"

Then, I add condor-update to the glexec policy:

glexec:

verifyproxy -> gumsclient
gumsclient -> condor_updates
condor_updates -> tracking

Note we use the "tracking" module locally; most sites will use the "glexec-tracking" module. Pick the appropriate one.

Finally, you need to turn on the I/O proxy in the Condor submit file. We do this by editing condor.pm (for RPMs, located in /usr/lib/perl5/vendor_perl/5.8.8/Globus/GRAM/JobManager/condor.pm). We add the following line into the submit routine, right before queue is added to the script file:

print SCRIPT_FILE "+WantIOProxy=TRUE\n";

All new incoming jobs will get this attribute; any glexec invocations they do will be reflected at the CE!

GUMS and Worker Node Certificates

To map a certificate to a Unix user, glexec calls out to the GUMS server using XACML with a grid-interoperable profile. In the XACML callout, GUMS is given the payload's DN and VOMS attributes. The same library (LCMAPS/SCAS-client) and protocol can also make callouts directly to SCAS, more commonly used in Europe.

GUMS is a powerful and flexible authorization tool; one feature is that it allows different mappings based on the originating hostname. For example, if desired, my certificate could map to user hcc at red.unl.edu but map to cmsprod at ff-grid.unl.edu. To prevent "just anyone" from probing the GUMS server, GUMS requires the client to present X509 a certificate (in this case, the hostcert); it takes the hostname from the client's certificate.

This has the unfortunate side-effect of requiring a host certificate on every node that invokes GUMS; OK for the CE (100 in the OSG), but not for glexec on the worker nodes (thousands on the OSG).

When glexec is invoked in EGI, SCAS is invoked using the pilot certificate for HTTPS and information about the payload certificate in the XACML callout; this requires no worker node host certificate.

To replicate how glexec works in EGI, we had to develop a small patch to GUMS. When the pilot certificate is used for authentication, the pilot's DN is recorded to the logs (so we know who is invoking GUMS), but the host name is self-reported in the XACML callout. As the authentication is still performed, we believe this relaxing of the security model is acceptable.

A patched, working version of GUMS can be found in Koji and is available in the osg-development repository. It will still be a few months before the RPM-based GUMS install is fully documented and released, however.

Once installed, two changes need to be made at the server:

Do all hostname mappings based on "DN" in the web interface, not the "CN".
Any group of users (for example, /cms/Role=pilot) that want to invoke GUMS must have "read all" access, not just "read self".

Further, /etc/lcmaps.db needs to be changed to remove the following lines from the gumsclient module:

"-cert   /etc/grid-security/hostcert.pem"
"-key    /etc/grid-security/hostkey.pem"
"--cert-owner root"

This will be all automated going forward - but all should help remove some of the pain in deploying glexec!

Improving the glexec-enabled life

2011-11-11T08:32:00.000-08:00

Pilot-based workflow management systems have had a dramatic transformation of how we view the grid today. Instead of queueing a job (the "payload") in a workflow onto a site on a grid, these systems send an "empty" job that starts up, then downloads and starts the payload from from a central endpoint. In CS terms, it switches from a model of "work delegation" to "resource allocation". By allocating the resource (i.e., starting the pilot job) prior to delegating work, users no longer have to know the vagaries/failure modes of direct grid submission and don't have to pay the price of sending their payloads to a busy site!

In short, pilot jobs make the grid much better.

However, like most concepts, pilot jobs are a trade-off: they make life easier for users, but harder for security folks and sysadmins. Pilots are sent using one certificate, but payloads are run under a different identity. If the payload job wants to act on behalf of the user, it needs to bring the user's grid credentials to the worker node. [Side note: this is actually an interesting assumption. The PanDA pilot system, heavily utilized by ATLAS, does not bring credentials to the worker node. This simplifies this problem, but opens up a different set of concerns.] If both pilot and payload are run as the same Unix user, the payload user can easily access the credentials (including the pilot credentials), executables, and output data of other running payloads.

The program glexec is a "simple" idea to solve this problem: given a set of grid credentials, launch a process under corresponding the Unix account at the site. For example, with credentials from the HCC VO:

[bbockelm@brian-test ~]$ whoami
bbockelm
[bbockelm@brian-test ~]$ GLEXEC_CLIENT_CERT=/tmp/x509up_u1221 /usr/sbin/glexec 
/usr/bin/whoami
hcc

(You'll notice the invocation is not as simple as typing "glexec whoami"; it's not exactly designed for end-user invocation). To achieve the user switching, glexec has to be setuid root. Setuid binaries must be examined under a security microscope, which have unfortunately led to a slow adoption of glexec.

The idea is that pilot jobs would wrap the payload with a call to "glexec", separating the payload from the pilot and other payloads. From there, it goes horribly wrong. Not wrong really - but rather things get sticky.

Since the pilot and payload are both low-privileged users, the pilot doesn't have permission to clean up or kill the payload. It must again use glexec to send signals and delete sandboxes. The several invocations are easy to screw up (and place load on the authorization system!). There are tricky error conditions - if authorization breaks in the middle of the job, how does the pilot clean up the payload?

As the payload is a full-fledged Linux process, it can create other processes, daemonize, escape from the batch system, etc. As previously discussed, the batch system - with root access - typically does a poor job tracking processes. The pilot will be hopeless unless we provide some assistance.

Glexec imposes an integration difficulty at some sites. There are popular cron scripts that kill process belonging to users on a node that aren't currently running batch system jobs. So, if the pilot maps to "cms" and the payload maps to "cmsuser", the batch system only knows about "cms", and the cronjob will kill all processes belonging to "cmsuser". We lost quite a few jobs at some sites before we figured this out!

Site admins manage the cluster via the batch system. Since the payload is invisible to the batch system, we're unable to kill jobs from a user with batch system tools (condor_rm, qdel). In fact, if we get an email from a user asking for help understanding their jobs, we can't even easily find where the job is running! Site admins have to ssh into each worker node and examine the running jobs; a process that is simply medieval.

Finally, on the OSG, invoking the authorization system requires host certificate credentials. This is not a problem when host certs are needed for a handful of CEs at the site, but explodes when glexec is run on each worker node. This is a piece of unique state on the worker nodes for sites to manage, adding to the glexec headache.

We're the Government. We're here to help.

The OSG Technology group has decided to tackle the three biggest site-admin usability issues in glexec:

Batch system integration: The Condor batch system provides the ability for running jobs to update the submit node with arbitrary status. We have developed a plugin that updates the job's ClassAd with the payload's DN whenever glexec is invoked.
Process tracking: There is an existing glexec plugin to do process tracking. However, this requires a admin to set up secondary GID ranges (an administration headache) and suffers the previously-documented process tracking issues. We will port the ProcPolice daemon over to the glexec plugin framework.
Worker node certificates: We propose to fix this via improvements to GUMS, allowing the mappings to be performed based on the presence of "Role=pilot" VOMS extension in the pilot certificate.

The plugins in (1) and (2) have been prototyped, and are available in the osg-development repository as "lcmaps-plugins-condor-update" and "lcmaps-plugins-process-tracking", respectively. The third item is currently cooking.

The "lcmaps-plugins-condor-update" is especially useful, as it's a brand-new capability as opposed to an improvement. It advertises three attributes in the job's ClassAd:

glexec_x509userproxysubject: The DN of the payload user.
glexec_user: The Unix username for the payload.
glexec_time: The Unix time when glexec was invoked.

We can then use it to filter and locate jobs. For example, if a user named Ian complains his jobs are running slowly, we could locate a few with the following command:

[bbockelm@t3-sl5 ~]$ condor_q -g -const 'regexp("Ian", glexec_x509userproxysubject)' -format '%s ' ClusterId -format '%s\n' RemoteHost | head
868341 slot6@red-d11n10.red.hcc.unl.edu
868343 slot7@node238.red.hcc.unl.edu
868358 slot6@red-d11n9.red.hcc.unl.edu
868366 slot2@node239.red.hcc.unl.edu
868373 slot3@node119.red.hcc.unl.edu
868741 slot8@red-d9n6.red.hcc.unl.edu
868770 slot3@red-d9n8.red.hcc.unl.edu
868819 slot5@node109.red.hcc.unl.edu
868820 slot4@node246.red.hcc.unl.edu
868849 slot2@red-d11n6.red.hcc.unl.edu

Slick!

KVM and Condor (Part 2): Condor configuration for VM Universe & VM Image Staging

2011-10-19T07:14:00.000-07:00

This is Part 2 of my previous blog KVM and Condor (Part 1): Creating the virtual machine. In this blog I will share the steps for configuring Condor VM Universe, in addition I will also discuss the steps involved in staging the VM disk images. It is assumed that you have a basic setup of Condor working and there is a shared file system that is accessible from each of the worker nodes.

As a first step please make sure that the worker nodes support KVM based virtualization, if they do not, then you may use:

yum groupinstall "KVM"

and yum -y install kvm libvirt libvirt-python python-virtinst libvirt-client

Configuring Condor for KVM

For Condor to support VM universe the following attributes must be set in the Condor configuration of each of the worker nodes (this may be done by modifying the the local Condor config file)

VM_GAHP_SERVER = $(SBIN)/condor_vm-gahp
VM_GAHP_LOG = $(LOG)/VMGahpLog
VM_MEMORY = 5000
VM_TYPE = kvm
VM_NETWORKING = true
VM_NETWORKING_TYPE = nat
ENABLE_URL_TRANSFERS = TRUE
FILETRANSFER_PLUGINS = /usr/local/bin/vm-nfs-plugin

The explanation of the above attributes follow:

Attribute	Description
VM_GAHP_SERVER	The complete path and file name of the condor_vm-gahp.
VM_GAHP_LOG	The complete path and file name of the condor_vm-gahp log.
VM_MEMORY	A VM universe job is required to specify the memory needs for the disk image with vm_memory (Mbytes) in its job description file. On the worker node the value of the VM_MEMORY configuration is used for matching the memory requested by the job. VM_MEMORY is an integer value that specifies the maximum amount of memory in Mbytes that will be allowed for the virtual machine program.
VM_TYPE	This attribute can have values: kvm, xen or vmware and specify the type of supported virtual machine software.
VM_NETWORKING	Must be set to true to support networking in the VM instances.
VM_NETWORKING_TYPE	This is a string value describing the type of networking.
ENABLE_URL_TRANSFERS	This is a Boolean value when True causes the condor_starter for a job to invoke all plug-ins defined by FILETRANSFER_PLUGINS when a file transfer is specified with a URL in the job description file.
FILETRANSFER_PLUGINS	Is a comma separated list of absolute paths of executable(s) for plug-ins that will accomplish the task of file transfer when a job requests the transfer of an input file by specifying a URL.

The File Transfer Plugin

So far we have modified the configurations of the condor worker node for supporting Condor VM universe. Next I will describe a barebones FILETRANSFER_PLUGINS executable. I will use bash for scripting and the plugin will reside at :/usr/local/bin/vm-nfs-plugin on each of the worker nodes.

#!/bin/bash
#file: /usr/local/bin/vm-nfs-plugin
#----------------------------------------
# Plugin Essential
if [ "$1" = "-classad" ]
then
   echo "PluginVersion = \"0.1\""
   echo "PluginType = \"FileTransfer\""
   echo "SupportedMethods = \"nfs\""
   exit 0
fi

#----------------------------------------
# Variable definitions
# transferInputstr_format='nfs:<abs path to (nfs hosted) inputfile file>:<basename of vminstance file>'
WHICHQEMUIMG='/usr/bin/qemu-img'
initdir=$PWD
transferInputstr=$1
#-------------------------------------------
# Split the first argument to an array
IFS=':' read -ra transferInputarray <<< "$transferInputstr"
#-------------------------------------------
#create the vm instance copy on write
$WHICHQEMUIMG create -b ${transferInputarray[1]} -f  qcow2   ${initdir}/${transferInputarray[2]}
exit 0;

Overall the idea behind the above script is to create a qcow2 formatted VM instance file in the condor allocated execute folder. The details of code blocks above are listed below:

The “# Plugin Essential” part of the codes is a requirement for a Condor file transfer plug-in so that a plug-in can be registered appropriately to handle file transfers based on the methods (protocols) it supports. The condor_starter daemon invokes each plug-in with a command line argument ‘-classad’ to identify the protocols that a plug-in supports, it expects that the plug-in will respond with an output of three ClassAd attributes. The first two are fixed: PluginVersion = "0.1" and PluginType = "FileTransfer"; the third is the ClassAd attribute ‘SupportedMethods’ having a string value containing comma separated list of the protocols that the plug-in handles. Thus, in the script above SupportedMethods = "nfs" identifies that the plug-in vm-nfs-plugin supports a user defined protocol ‘nfs’. Accordingly, the ‘nfs’ string will be matched to the protocol specification as given within a URL in the transfer_input_files command in a Condor job description file.

For a file transfer invocation a plug-in is invoked with two arguments - the first being the URL specified in the job description file; and the second argument being the absolute path identifying where to place the transferred file. The plug-in is expected to transfer the file and exit with a status of 0 when the transfer is successful. A non-zero status must be returned when the transfer is unsuccessful, for an unsuccessful transfer the job is placed on a hold and the job ClassAd attribute HoldReason is set with a message along with HoldReasonSubCode which is set to the exit status of the plug-in.

In the bash codes above I am only using the first argument that is received by the plugin. Further, it is decided that the value of transfer_input_files will follow the format as commented in the script variable transferInputstr_format i.e. 'nfs:<abs path to (nfs hosted) inputfile file>:<basename of vminstance file>'. Thus after splitting the first argument received by the plugin, the plug-in creates a qcow2 image with a backing file based on the original template.

Now once we send a condor reconfig using condor_reconfig to the worker node or restart condor service (service condor restart) on the worker nodes the plug-in is ready to be used; an example submit file is shown below.

Example Job Description

#Condor job description file
universe=vm
vm_type=kvm
executable=agurutest_vm
vm_networking=true
vm_no_output_vm=true
vm_memory=1536
#Point to the nfs location that will be available from worker node
transfer_input_files=nfs://<path to the vm image>:vmimage.img
vm_disk="vmimage.img:hda:rw"
requirements= (TARGET.FileSystemDomain =!= FALSE) && ( TARGET.VM_Type == "kvm" ) && ( TARGET.VM_AvailNum > 0 ) && ( VM_Memory >= 0 ) 
log=test.log
queue 1

This submit file should invoke the vm-nfs-plugin and a VM instance should start on a worker node. You can test the VM using a shell on the worker node and then using virsh utility.

That is all for this blog, in the Part 3 which is the last part of this series I will write about using file transfer plugin with Storage Resource Manager (SRM).

Per-Batch Job Network Statistics

2011-09-08T09:00:00.000-07:00

Introduction

The OSG takes a fairly abstract definition of a cloud:

A cloud is a service that provision resources on-demand for a marginal cost

The two important pieces of this definition are "resource provisioning" and "marginal cost". The most common cloud instance you'll run into is Amazon EC2, which provisions VMs; depending on the size of VM, the marginal cost is between $0.03 and $0.80 an hour.

The EC2 charge model is actually more complicated than just VMs-per-hour. There's additional charges for storage and network use. In controlled experiments last year, CMS determined the largest cost of using EC2 was not the CPU time, but the network usage.

This showed a glaring hole in OSG's current accounting: we only record wall and CPU time. For the rest of other metrics - which can't be estimated accurately by looking at wall time - we are blind.

Long story short - if OSG ever wants to provide a cloud service using our batch systems, we need better accounting.

Hence, we are running a technology investigation to bring batch system accounting up to par with EC2's: https://jira.opensciencegrid.org/browse/TECHNOLOGY-2

Our current target is to provide a proof-of-concept using Condor. With Condor 7.7.0's cgroup integration, the CPU/memory usage is very accurate, but network accounting for vanilla jobs is missing. Network accounting is the topic for this post; we have the following goals:

The accounting should be done for all processes spawned during the batch job.
All network traffic should be included.
Separately account LAN traffic from WAN traffic (in EC2, these have different costs).

The Woes of Linux Network Accounting

The state of Linux network accounting, well, sucks (for our purposes!). Here's a few ways to tackle it, and why each of them won't work:

Counting packets through an interface: If you assume that there is only one job per host, you can count the packets that go through a network interface. This is a big, currently unlikely, assumption.
Per-process accounting: There exists a kernel patch floating around on the internet that adds per-process in/out statistics. However, other than polling frequently, we have no mechanism to account for short-lived processes. Besides, asking folks to run custom kernels is a good way to get ignored.
cgroups: There is a net controller in cgroups. This marks packets in such a way that they can be manipulated by the tc utility. tc controls the layer of buffering before packets are transferred to the network card and can do accounting. Unfortunately:

In RHEL6, there's no way to persist tc rules.
This only accounts for outgoing packets; incoming packets do not pass through.
We cannot distinguish between local network traffic and off-campus network traffic. This can actually be overcome with a technique similar in difficulty to byte packet filters (BPF), but would be difficult.

ptrace or dynamic loader techniques: There exists libraries (exemplified by parrot) that provide a mechanism for intercepting calls. We could instrument this. However, this path is notoriously buggy and difficult to maintain: it would require a lot of code, and not work for statically-compiled processes.

The most full-featured network accounting is in the routing code controlled by iptables. Particularly, this can account incoming and outgoing traffic, plus differentiate between on-campus and off-campus traffic.

We're going to tackle the problem using iptables; the trick is going to be distinguishing all the traffic from a single batch job. As in the previous series on managing batch system processes, we are going borrow heavily from techniques used in Linux containers.

Per-Batch Job Network Statistics

To get perfect per-batch-job network statistics that differentiate between local and remote traffic, we will combine iptables, NAT, virtual ethernet devices, and network namespaces. It will somewhat be a tour-de-force of the Linux kernel networking - and currently very manual. Automation is still forthcoming.

This recipe is a synthesis of the ideas presented in the following pages:

Manually setting up networking for a container: http://lxc.sourceforge.net/index.php/about/kernel-namespaces/network/configuration/
Traffic accounting with iptables: http://www.catonmat.net/blog/traffic-accounting-with-iptables/
Using a NAT between the "container"

We'll be thinking of the batch job as a "semi-container": it will get its own network device like a container, but have more visibility to the OS than in a container. To follow this recipe, you'll need RHEL6 or later.

First, we'll create a pair of ethernet devices and set up NAT-based routing between them and the rest of the OS. We will assume eth0 is the outgoing network device and that the IPs 192.168.0.1 and 192.168.0.2 are currently not routed in the network.

Enable IP forwarding:
```
echo 1 > /proc/sys/net/ipv4/ip_forward
```
Create an veth ethernet device pair:
```
ip link add type veth
```
This will create two devices, veth0 and veth1, that act similar to a Unix pipe: bytes sent to veth1 will be received by veth0 (and vice versa).
Assign IPs to the new veth devices; we will use 192.168.0.1 and 192.168.0.2:
```
ifconfig veth0 192.168.0.1/24 up
ifconfig veth1 192.168.0.2/24 up
```
Download and compile ns_exec.c; this is a handy utility developed by IBM that allows us to create processes in new namespaces. Compilation can be done like this:
```
gcc -o ns_exec ns_exec.c
```
This requires a RHEL6 kernel and the kernel headers
In a separate window, launch a new shell in a new network and mount namespace:
```
./ns_exec -nm -- /bin/bash
```
We'll refer to this as shell 2 and our original window as shell 1.
Use ps to determine the pid of shell 2. In shell 1, execute:
```
ip link set veth1 netns $PID_OF_SHELL_2
```
In shell 2, you should be able to run ifconfig and see veth1.
In shell 2, re-mount the /sys filesystem and enable the loopback device:
```
mount -t sysfs none /sys
ifconfig lo up
```

At this point, we have a "batch job" (shell 2) with its own dedicated networking device. All traffic generated by this process - or its children - must pass through here. Traffic generated in shell 2 will go into veth1 and out veth0. However, we haven't hooked up the routing for veth0, so packets currently stop there; fairly useless.

Next, we create a NAT between veth0 and eth0. This is a point of convergence - alternately, we could bridge the networks at layer 2 or layer 3 and provide the job with its own public IP. I'll leave that as an exercise for the reader. For the NAT, I will assume that 129.93.0.0/16 is the on-campus network and everything else is off-campus. Everything will be done in shell 1:

Verify that any firewall won't be blocking NAT packets. If you don't know how to do that, turn off the firewall with
```
iptables -F
```
. If you want a firewall, but don't know how iptables works, then you probably want to spend a few hours learning first.

Enable the packet mangling for NAT:

iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Forward packets from veth0 to eth0, using separate rules for on/off campus:

iptables -A FORWARD -i veth0 -o eth0 --dst 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i veth0 -o eth0 ! --dst 129.93.0.0/16 -j ACCEPT

Forward TCP connections from eth0 to veth0 using separate rules:

iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED --src 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED ! --src 129.93.0.0/16 -j ACCEPT

At this point, you can switch back to shell 2 and verify the network is working. iptables will automatically do accounting; you just need to enable command line flags to get it printed:
iptables -L -n -v -x
If you look at the network accounting reference, they show how to separate all the accounting rules into a separate chain. This allows you to, for example, reset counters for only the traffic accounting. On my example host, the output looks like this:

Chain INPUT (policy ACCEPT 4 packets, 524 bytes)
    pkts      bytes target     prot opt in     out     source               destination        

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination        
      30     1570 ACCEPT     all  --  veth0  eth0    0.0.0.0/0            129.93.0.0/16      
      18     1025 ACCEPT     all  --  veth0  eth0    0.0.0.0/0           !129.93.0.0/16      
      28    26759 ACCEPT     all  --  eth0   veth0   129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED
      17    10573 ACCEPT     all  --  eth0   veth0  !129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED

Chain OUTPUT (policy ACCEPT 4 packets, 276 bytes)
    pkts      bytes target     prot opt in     out     source               destination

As you can see, my "job" has downloaded about 26KB from on-campus and 10KB from off-campus.

Viola! Network accounting appropriate for a batch system!

Creating a VM for OpenStack

2011-08-26T12:50:00.000-07:00

Intro

Here at HCC, we have a few VM-based projects going. One is the Condor-based VM launching that Ashu referenced in his previous posting. That project is to take an existing capability (Condor batch system hooked to the grid) and extending it; instead of launching processes, one can launch an entire VM.

One of our other employees, Josh, has been working from the other direction: taking a common "cloud platform", OpenStack, and seeing if it can be adopted to our high-throughput needs. The OpenStack work is in its beginning phases, but bits and pieces are starting to become functional.

Last night, I tried out install for the first time. One of the initial tasks I wanted to accomplish is to create a custom VM. A lot of the OpenStack documentation is fairly Ubuntu specific, so I've taken their pages and adopted them for installing from a CentOS 5.6 machine. Unfortunately, I didn't take any nice screen shots like Ashu did, but I hope this will be useful to others.

Long term, we plan to open OpenStack up to select OSG VOs for testing. While we are still in the "tear it down and rebuild once a week" mode, it's just been opened up to select HCC users.

So, without further ado, I present...

Creating a new Fedora image using HCC's OpenStack

These notes are based on the upstream openstack documents here:

http://docs.openstack.org/trunk/openstack-compute/admin/content/creating-a-linux-image.html

Prerequisites

It all starts with an account.

For local users, contact hcc-support to get your access credentials. They will come in a zipfile. Download the zipfile into your home directory and unpack it. Among other things, there will be a novarc file. Source this:

source novarc

This will set up environment variables in your shell pointing to your login credentials. Do not share these with other people! You will need to do this each time you open a new shell.

To create the image, you will need root access on a development machine with KVM installed. I used a CentOS 5.6 machine and did:

yum groupinstall kvm

to get the various necessary KVM packages. I als

First, create a new raw image file:

qemu-img create -f raw /tmp/server.img 5G

This will be the block device that is presented to your virtual machine; make it as large as necessary. Our current hardware is pretty space-limited: smaller is encouraged. Next, download the Fedora boot ISO:

curl http://serverbeach1.fedoraproject.org/pub/alt/bfo/bfo.iso > /tmp/bfo.iso

This is a small, 670KB ISO file that contains just enough information to bootstrap the Anaconda installer. Next, we'll boot it as a virtual machine on your local system.

sudo /usr/libexec/qemu-kvm -m 2048 -cdrom /tmp/bfo.iso -drive file=/tmp/server.img -boot d -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize

This will create a simple virtual machine (2 cores, 2GB RAM) with /tmp/server.img as a drive, and boot the machine from /tmp/bfo.iso. It will also allow you to connect to the VM via a VNC viewer.

If you are physically on the host machine, you can use a VNC viewer for screen ":0". If you are logged in remotely (I log in from my Mac), you'll want to port-forward:

ssh -L 5900:localhost:5900 username@remotemachine.example.com

From your laptop, connect to localhost:0 with a VNC viewer. Note that the most common VNC viewers on the Mac (the built-in Remote Viewer and Chicken of the VNC) don't work with KVM. I found that "JollyFastVNC" works, but costs $5 from the App Store.

Once logged in, select the version of Fedora you'd like to install, and "click next" until the installation is done. Fedora 15 is sure nice :)

Fedora will want to reboot the machine, but the reboot will fail because KVM is set to only boot from the CD. So, once it tries to reboot, kill KVM and start it again with the following arguments:

sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/server.img -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize

Again, connect via VNC, and do any post-install customization. Start by updating and turning on SSH:

yum update
yum install openssh-server
chkconfig sshd on

You will need to tweak /etc/fstab to make it suitable for a cloud instance. Nova-compute may resize the disk at the time of launch of instances based on the instance type chosen. This can make the UUID of the disk invalid. Further, we will remove the LVM setup, and just have the root partition present (no swap, no /boot).

Edit /mnt/etc/fstab. Change the following three lines:

/dev/mapper/VolGroup-lv_root /                       ext4    defaults        1 1
UUID=0abae194-64c8-4d13-a4c0-6284d9dcd7b4 /boot                   ext4    defaults        1 2
/dev/mapper/VolGroup-lv_swap swap                    swap    defaults        0 0

to just one line:

LABEL=uec-rootfs              /          ext4           defaults     0    0

Since, Fedora does not ship with an init script for OpenStack, we will do a nasty hack for pulling the correct SSH key at boot. Edit the /etc/rc.local file and add the following lines before the line "touch /var/lock/subsys/local":

depmod -a
modprobe acpiphp

# simple attempt to get the user ssh key using the meta-data service
mkdir -p /root/.ssh
echo >> /root/.ssh/authorized_keys
curl -m 10 -s http://169.254.169.254/latest/meta-data/public-keys/0/openssh-key | grep 'ssh-rsa' >> /root/.ssh/authorized_keys
echo "AUTHORIZED_KEYS:"
echo "************************"
cat /root/.ssh/authorized_keys
echo "************************"

Once you are finished customizing, go ahead and power off:

poweroff

Converting to an acceptable OpenStack format

The image that needs to be uploaded to OpenStack needs to be an ext4 filesystem image; we currently have a raw block device image. We will extract this filesystem from running a few commands on the host machine. First, we need to find out the starting sector of the partition. Run:

fdisk -ul /tmp/server.img

You should see an output like this (the error messages are harmless):

last_lba(): I don't know how to handle files with mode 81a4
You must set cylinders.
You can do this from the extra functions menu.

Disk /dev/loop0: 5368 MB, 5368709120 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes

      Device Boot      Start         End      Blocks   Id  System
/dev/loop0p1   *        2048     1026047      512000   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/loop0p2         1026048    10485759     4729856   8e  Linux LVM
Partition 2 does not end on cylinder boundary.

Note the following commands assume the units are 512 bytes. You will need the start and end number for the "Linux LVM"; in this case, it is 1026048 and 10485759.

Copy the entire partition to a new file

dd if=/tmp/server.img of=/tmp/server.lvm.img skip=1026048 count=$((10485759-1026048)) bs=512

For "skip" and "count", use the begin and end you copy/pasted from the fdisk output. Now we have our LVM image; we'll need to activate it. First, mount the LVM image on the loopback device and look for the volume group name:

[bbockelm@localhost ~]$ sudo /sbin/losetup /dev/loop0 /tmp/server.lvm.img
[bbockelm@localhost ~]$ sudo /sbin/pvscan
  PV /dev/sdb1    VG vg_home     lvm2 [7.20 TB / 0    free]
  PV /dev/sda2    VG vg_system   lvm2 [73.88 GB / 0    free]
  PV /dev/loop0   VG VolGroup    lvm2 [4.50 GB / 0    free]
  Total: 3 [1.28 TB] / in use: 3 [1.28 TB] / in no VG: 0 [0   ]

Note the third listing is for our loopback device (/dev/loop0) and a volume group named, simply, "VolGroup". We'll want to activate that:

[bbockelm@localhost ~]$ sudo /sbin/vgchange -ay VolGroup
  2 logical volume(s) in volume group "VolGroup" now active

We can now see the Fedora root file system in /dev/VolGroup/lv_root. We use dd to make a copy of this disk:

sudo dd if=/dev/VolGroup/lv_root of=/tmp/serverfinal.img

I get the following output:

[bbockelm@localhost ~]$ sudo dd if=/dev/VolGroup/lv_root of=/tmp/serverfinal2.img
3145728+0 records in
3145728+0 records out
1610612736 bytes (1.6 GB) copied, 14.5444 seconds, 111 MB/s

It's time to unmount all our devices. Start by removing the LVM:

[bbockelm@localhost ~]$ sudo /sbin/vgchange -an VolGroup
  0 logical volume(s) in volume group "VolGroup" now active

Then, unmount our loopback device:

[bbockelm@localhost ~]$ sudo /sbin/losetup -d /dev/loop0

We will do one last tweak: change the label on our filesystem image to "uec-rootfs":

sudo /sbin/tune2fs -L uec-rootfs /tmp/serverfinal.img

*Note* that your filesystem image is ext4; if your host is RHEL5.x (this is my case!), your version of tune2fs will not be able to complete this operation. In this case, you will need to restart your VM in KVM with the newly-extracted serverfinal.img as a second hard drive. I did the following KVM invocation:

sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/server.img -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize -drive file=/tmp/serverfinal.img

The second drive shows up as /dev/sdb; go ahead and re-execute tune2fs from within the VM:

[root@localhost ~]# tune2fs -L uec-rootfs /dev/sdb

Extract Kernel and Initrd for OpenStack

Fedora creates a small boot partition separate from the LVM we extracted previously. We'll need to mount it, and copy out the kernel and initrd. First, mount the loopback device and map the partitions.

[bbockelm@localhost ~]$ sudo /sbin/losetup -f /tmp/server.img
[bbockelm@localhost ~]$ sudo /sbin/kpartx -a /dev/loop0

The boot partition should now be available at /dev/mapper/loop0p1. Mount this:

[bbockelm@localhost ~]$ sudo mkdir  /tmp/server_image/
[bbockelm@localhost ~]$ sudo mount /dev/mapper/loop0p1  /tmp/server_image/

Now, copy out the kernel and initrd:

[bbockelm@localhost ~]$ cp /tmp/server_image/vmlinuz-2.6.40.3-0.fc15.x86_64 ~
[bbockelm@localhost ~]$ cp /tmp/server_image/initramfs-2.6.40.3-0.fc15.x86_64.img ~

Unmount and unmap:

[bbockelm@localhost ~]$ sudo umount /tmp/server_image
[bbockelm@localhost ~]$ sudo /sbin/kpartx -d /dev/loop0
[bbockelm@localhost ~]$ sudo /sbin/losetup -d /dev/loop0

Upload into OpenStack

We need to bundle, then upload the kernel, initrd, and finally the image. First, the kernel:

[bbockelm@localhost ~]$ euca-bundle-image -i ~/vmlinuz-2.6.40.3-0.fc15.x86_64 --kernel true
Checking image
Encrypting image
Splitting image...
Part: vmlinuz-2.6.40.3-0.fc15.x86_64.part.00
Generating manifest /tmp/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
[bbockelm@localhost ~]$ euca-upload-bundle -b testbucket -m /tmp/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
Checking bucket: testbucket
Uploading manifest file
Uploading part: vmlinuz-2.6.40.3-0.fc15.x86_64.part.00
Uploaded image as testbucket/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
[bbockelm@localhost ~]$ euca-register testbucket/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
IMAGE	aki-0000000a

Write down the kernel ID; it is aki-0000000a above. Then, the initrd:

euca-bundle-image -i ~/initramfs-2.6.40.3-0.fc15.x86_64.img --ramdisk true
euca-upload-bundle -b testbucket -m /tmp/initramfs-2.6.40.3-0.fc15.x86_64.img.manifest.xml
euca-register testbucket/initramfs-2.6.40.3-0.fc15.x86_64.img.manifest.xml

My initrd's ID was ari-0000000b. Finally, the disk image itself

euca-bundle-image --kernel aki-0000000a --ramdisk ari-0000000b -i /tmp/serverfinal.img -r x86_64

This will save the image into /tmp and named "serverfinal.img.manifest.xml". I didn't particularly care for the name, so I changed it to "fedora-15.img.manifest.xml". Now, upload:

euca-upload-bundle -b testbucket -m /tmp/fedora-15.img.manifest.xml
euca-register testbucket/serverfinal2.img.manifest.xml

Congratulations! You now have a brand-new Fedora-15 image ready to use. Fire up HybridFox and see if you were successful.

KVM and Condor (Part 1): Creating the virtual machine.

2011-08-18T16:54:00.000-07:00

My next topic of discussion which will be a two part blog is regarding launching a Virtual Machine (VM) in a Condor environment. In the first of these two blogs I will share the steps that I took to create a VM that I will launch as a job in Condor.

I will be using Kernel-based Virtual Machine (KVM) implementation for Linux Guests. KVM is a full virtualization framework which can run multiple unmodified guests including various flavors of Microsoft Windows, Linux Operating Systems and other UNIX family systems. In order to see the types of Guest operating systems and platforms that KVM supports you can look at http://www.linux-kvm.org/page/Guest_Support_Status

Let’s get started. For this blog the host system on which I am working is running CentOS 6.0 with Linux 2.6.32 on a x86_64 platform. I will be creating a CentOS 5.6 image for the VM guest. As the first step, I will get my host system ready with KVM tools and other dependencies. To do this I require a package called kvm – this package includes the VM kernel module. In addition to the kvm package I will be using three tools (viz. virt-install, virsh, and virt-viewer) from toolkit called libvirt. Libvirt (http://libvirt.org/) is a hypervisor-independent API that is able to interact with the virtualization capabilities of various operating systems. The commands below show you how to use yum to install kvm and libvirt related packages:

yum install kvm

yum install virt-manager libvirt libvirt-python python-virtinst libvirt-client

I am now ready to create the VM by using the following command:

   1:  virt-install \

   2:  --name=vm56-25GB \

   3:  --disk path=/home/aguru/myvms/vm5.6-25GB.img,sparse=true,size=25 \

   4:  --ram=2048 \

   5:  --location=http://mirror.unl.edu/centos/5.6/os/x86_64/ \

   6:  --os-type=linux  \

   7:  --vnc

In the above code snippet 'virt-install' is a libvirt command line tool for provisioning new virtual machines. The different options that I have used above are explained below
--name is the name of the new machine that I am creating
--disk option specifies the absolute path of the virtual machine image (file) that will be created. The ‘sparse’ option in the same line means that the host system does not have to allocate all the space up-front, and the ‘size’ gives the size of the hard disk drive of the VM in GB
--ram is the RAM of guest in MB
--location using this option I am providing a location for network install where the OS install files for the guest are located
--os-type specifies type of guest operating system
--vnc specifies to setup a virtual console in the guest and export it as a VNC server in host

Unless there are any missing dependencies and tools that somehow did not get installed correctly - your install should start with a new VNC window popping up on your display. I have a few screen captures of what you may see shown below.

** Just a quick note - to release the mouse cursor from the VNC window you can use Ctrl-Alt.

and so on with finally a screen as below

On the final screen of installation you can click the 'Reboot' button from the VM window to restart the guest VM.

Few basic commands to list, start and stop a VM

virsh list –all

The output of virsh list --all shows the defined VMs and their current state for e.g. a typical output may look like:

Id Name                 State
----------------------------------
- vm56-15KSGB          shut off
- vm56-25GB            shut off

In order to start a VM from the shut off state issue a virsh start command. Note below that the virsh list –all now shows an Id and the running state of the VM (vm56-15KSGB)

virsh start vm56-15KSGB

virsh list --all
Id Name                 State
----------------------------------
1 vm56-15KSGB          running
- vm56-25GB            shut off

To launch a VNC console for displaying the console of a running VM you can use virt-viewer e.g.

virt-viewer  1

And finally, to shutdown a running VM use virsh shutdown or force a virsh destroy e.g.

virsh shutdown 1

virsh destroy 1

Both virt-viewer and virsh shutdown take the Id of the running VM as an argument.

What if I have a Kickstart file for the VM I want to create?

In case you have a Kickstart file that you will like to use for creating the VM you may use the following command:

   1:  virt-install \

   2:  --name=vm56-15KSGB \

   3:  --disk path=/home/aguru/myvms/vm56-15KSGB.img,sparse=true,size=15 \

   4:  --ram=2048 \

   5:  --location=http://newman.ultralight.org/os/centos/5.5/x86_64 \

   6:  --os-type=linux  \

   7:  --vnc \

   8:  -x "ks=http://httpdserver.hosting.kickstart/pathto.kickstart.file"

The only thing to note which is additional in this virt-install command as compared to its previous use in this blog is the extra flag '–x '. The value passed along with the -x flag points to the location of the web location of the kickstart file.

That is it all for this post. In the next post I will talk about using this created image and then launching it in a Condor VM Universe.

Squid Caching in OSG Environment

2011-07-12T19:23:00.000-07:00

A few months back I assisted a research group from University of Nebraska Medical Center (UNMC) in deploying a search for mass spectrometry-based proteomics analysis. This search was performed using a program called The Open Mass Spectrometry Search Algorithm (OMSSA) using the Open Science Grid (OSG) via GlideinWMS Frontend. In this blog I will talk about the motivation and use of HTTP file transfer along with squid caching for input data and executable files for the jobs deployed over the OSG. I will also show a basic example explaining the use of Squid in the OSG environment.

While working with the UNMC research group and after looking at the OMSSA specifications and documentation we identified the following characteristics regarding the computation and the data handling requirements for the proteomics analysis:
•   A total of 45 datasets with each dataset of about 21MB.
•   22,000 comparisons/searches (short jobs) per dataset
•   The executables along with search libraries for the comparison sum up to a total of 83MB as a compressed archive.
Based on the above requirements and a few additional tests it was determined that the job is well adapted for OSG via GlideinWMS. It was also decided that each GlideinWMS job will contain about 172 comparisons which calculates to a total of about 5756 individual jobs (22000*45/172).

Data in the Open Science Grid has always been more difficult to handle than computation. The challenges get more difficult when either of the number of jobs, or the data size increase. There are various methods that are used to overcome and simplify these challenges. Table 1 below shows a rule of thumb that I generally follow to help identify the best mode of data transfers for jobs in OSG environment. Each data transfer method in Table 1 has its own advantages viz. Condor’s internal file transfer is built-in method so no extra scripting is required. SRM can handle large data stores, and has the ability to handle large size data transfers. Pre-staging can distribute the load of pulling down data.

Table 1. Rule of thumb for data transfer using condor/GlideinWMS jobs in OSG
Data Size	Data Transfer Method
< 10MB	Condor's File Transfer Mechanism
10MB - 500MB	Storage Element(SE)/Storage Resource Manager(SRM) interface
> 500MB	SRM/dCache or Pre-staging

When the number of jobs are significantly large and the data transfer size reaches the higher limits of Condor Internal File transfer, in our past experience we have found that HTTP file transfer has been fairly successful for us. By doing so we are able to distribute away the load of input file and executables transfer from the GlideinWMS Frontend server. For the proteomics analysis project since the compressed archive of the search library and the executables (83MB) was the same across all jobs, and the input data was the same for individual datasets we decided to extend the limits on our HTTP file transfer experiences by adding squid caching. The advantage of caching becomes more evident when more jobs are allocated compute nodes at a given site having a local (site specific) squid server until we reach the limit of the squid server itself.

Every CMS and ATLAS site is required to have squid whose location is available via the environment variable OSG_SQUID_LOCATION. This implies that by using a very simple wrapper script on a compute node one can easily pull down input files and/or executables using client tool such as wget or curl and then proceed with the actual computation. The example below shows a bash script that reads the OSG_SQUID_LOCATION environment variable on a compute node and then tries to download the file via squid, on a failure the script downloads the file directly from the source. (Ref: https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics)

#!/bin/sh
website=http://google.com/

#Section A
source $OSG_GRID/setup.sh
export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi

#Section B
wget --retry-connrefused --waitretry=20 $website

#Section C 
#Check if the download worked
if [ $? -ne 0 ]
then
   unset http_proxy
   wget --retry-connrefused --waitretry=20 $website
   if [ $? -ne 0 ]
   then
      exit 1
   fi
fi

Listed below is the explanation of the above code:

Section A: Check if environment variable OSG_SQUID_LOCATION is set, if so then export its value as the environment variable http_proxy which is used by wget for squid server location
Section B: Download the file using wget, the flag --retry-connrefused considers a connection refused as a transient error and tries again. This option helps to handle short term failures. The wait time of 20 seconds in between retries is specified via --waitretry
Section C: If download from the squid server fails then access the actual http source after unsetting the value of http_proxy

In addition to the availability of a OSG site specific squid server, for this type of data transfer to work one will require a reliable http server which can handle download requests from sites where the squid server is unavailable. Also, the http server must be able to handle requests which are originating from the squid servers along with any failover requests. At UNL we have setup a dedicated HTTP serving infrastructure that has a load balanced failover. This is implemented using the Linux Virtual server and its implementation details are shown in the diagram below.

You can see more detailed examples of squid usage at https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics
There is also an excellent presentation by Derek Weitzel available at http://docs.google.com/viewer?url=https%3A%2F%2Ftwiki.grid.iu.edu%2Ftwiki%2Fpub%2FCampusGrids%2FApr27%252c2011%2FCampusGridSquid.pdf&embedded=true

Part III: Bulletproof process tracking with cgroups

2011-07-08T16:09:00.000-07:00

Finally, it's time to provide a good solution for accomplishing process tracking in a Linux batch system.
If you recall in Part I, we surveyed common methods for process tracking and ultimately concluded that batch systems used userspace mechanisms (most of which were originally designed for shell-based process control, by the way) that were unreliable, or couldn't detect when failures occur. In Part II, the picture brightened: the kernel provided an event feed about process births and deaths, and informed us when messages were dropped.

In this post, we'll talk about a new feature called "cgroups", short for "control groups". Cgroups are a mechanism in the Linux kernel for managing a set of processes and all their descendents. They are managed through a filesystem-like interface (in the manner of /proc); the directory structure expresses the fact they are hierarchical, and filesystem permissions can be used to restrict the set of users allowed to manipulate them. By default, only root is allowed to manipulate control groups: unlike the process groups, process trees, and environment cookies examined before, a process typically has no ability to change its group. Further, unlike the proc connector API, the control group is assigned synchronously by the kernel at process creation time. Hence, fork-bombs are not an effective way to escape from the group.

While having the tracking done by the kernel is an immense improvement, the true power of cgroups become apparent through the use of multiple subsystems. Different cgroup subsystems may act to control scheduler policy, allocate or limit resources, or account for usage.

For example, the memory controller can be used to limit the amount of memory used by a set of processes. This is a huge improvement over the previous memory limit technique (rlimit), where the limit was assigned per-process. With rlimit, you could limit a single process to 1GB, but the job would just spawn N processes of 1GB each, sidestepping your limits. In the kernel shipped with Fedora 15, 10 controllers are active by default. For more information, you can check the documentation:

If you are a Redhat customer, I find the RHEL6 manual has the best cgroups documentation out there.

To see cgroups in action, use the systemd-cgls command found on Fedora 15. This will print out the current hierarchy of all cgroups. Here's what I see on my system (output truncated for display reasons):

├ condor
│ ├ 17948 /usr/sbin/condor_master -f
│ ├ 17949 condor_collector -f
│ ├ 17950 condor_negotiator -f
│ ├ 17951 condor_schedd -f
│ ├ 17952 condor_startd -f
│ ├ 17953 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 48...
│ └ 18224 condor_procd -A /var/run/condor/procd_pipe.STARTD -R 10000000 -S 60 -C 48...
├ user
│ ├ root
│ │ └ master
│ │   └ 6879 bash
│ └ bbockelm
│   ├ 1168
│   │ ├ 21426 sshd: bbockelm [priv]
│   │ ├ 21429 sshd: bbockelm@pts/3
│   │ ├ 21430 -bash
│   │ └ 21530 systemd-cgls
│   ├ 309
│   │ ├  1110 /usr/libexec/gvfsd-http --spawner :1.4 /org/gtk/gvfs/exec_spaw/0
│   │ ├  6198 gnome-terminal
│   │ ├  6202 gnome-pty-helper

(output trimmed)

└ system
  ├ 1 /bin/systemd --log-level info --log-target syslog-or-kmsg --system --dump...
  ├ sendmail.service
  │ ├ 8603 sendmail: accepting connections
  │ └ 8612 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
  ├ auditd.service
  │ ├ 8542 auditd
  │ ├ 8544 /sbin/audispd
  │ └ 8552 /usr/sbin/sedispatch
  ├ sshd.service
  │ └ 7572 /usr/sbin/sshd

(output trimmed)

All of the processes in my system are in the / cgroup; all login shells are placed inside a cgroup named

/user/$USERNAME

; each system service (such as ssh) is located inside a cgroup named

/system/$SERVICENAME

; finally, there's a special one named

/condor

; More on

/condor

later.

To see the cgroups for the current process, you can do the following:

[bbockelm@mydesktop ~]$ cat /proc/self/cgroup 
10:blkio:/
9:net_cls:/
8:freezer:/
7:devices:/
6:memory:/
5:cpuacct:/
4:cpu:/
3:ns:/
2:cpuset:/
1:name=systemd:/user/bbockelm/1168

Note that each processes is not necessarily in one cgroup. The rules are that a process can have one cgroup per mount, there is one or more controller per mount, and a controller can only be mounted once.

Each controller has statistics accessible via proc. For example, on Fedora 15, if I want to see how much memory all of my login shells are using, I can do the following:

[bbockelm@rcf-bockelman ~]$ cat /cgroups/memory/condor/memory.usage_in_bytes 
34365440

But what about the batch system?
I hope our readers can see the immediate utility in having a simple mechanism for unescapable process tracking. We examined one such mechanism before (adding a secondary GID per batch job), but it has a small drawback in that the secondary GID can be used to create permanent objects (files owned by the secondary GID) which outlive the lifetime of the batch job.

But, even in Part I of the series, we concluded that a perfect process tracking mechanism is not enough: we also need to be able to kill processes when the batch job is finished! The cgroups developer must have come to the same conclusion, as one controller is called the freezer. The freezer cgroup simply stops any process from receiving CPU time from the kernel. All process in the cgroups are frozen - and there is no way for a process to know it is about to freeze, as they aren't informed via signals. Hence, a process tracker can freeze the processes, send them all SIGKILL, and unfreeze them. All processes will end immediately; none will have the ability to hide in the /proc system or spawn new children in a race condition.

If you look at the first process tree posted, there is a cgroup called "condor". As I presented at Condor Week 2011, condor is now integrated with cgroups. It can be started in a cgroup the sysadmin specifies (such as /condor), and it will create a unique cgroup for each job (/cgroup/job_$CLUSTERID_$PROC_ID). It uses whatever controllers are active on the system to try and track memory consumption, CPU time, and block I/O. When the job ends or is killed, the freezer controller is used to clean up any processes.

Conclusions
As the disparate scientific clusters have become increasingly linked through the use of grids, improved process tracking has become more important. Many sites have users from across the nation; it's no longer possible for a sysadmin to be good friends with each user. Some have jobs with questionable quality; some have with virus-ridden laptops.

In the end, traditional process tracking in batch systems is not really ready for modern users. Most modern batch systems no longer rely solely on the original Unix grouping mechanisms, but will fall to user malicious users. The problem is not solvable only from user space.

Luckily, with the proc connector API (for any Linux 2.6 kernel) and cgroups (for recent Kernels), we can greatly improve the state of the art. The folks contributing to the Linux kernel is broad, but I understand much of the contributions for cgroups has come from the OpenVZ folks: thanks guys!.

As I've been exploring this subject, I have been implementing cgroup usage in Condor: I think it's a great new feature. They will be released with Condor 7.7.0, due in a few days. There's no reason other batch systems can't also adopt cgroups for process tracking: I hope the spread widely in the future!

Part II: Keeping a mindful eye on your users with ProcPolice.

2011-06-24T08:32:00.000-07:00

In Part I of this series, we talked about the various mechanisms a batch system uses to track your job's processes, and concluded the state of the art isn't particularly impressive. The only way to go is up; this post discusses an improved technique for process tracking in Linux. It was motivated by this blog post from the author of upstart. If you feel inspired here, and would like to read some code, it is highly recommended reading.

The previous post went from bad to dire: most batch systems use methods that are easily defeated by the job changing its runtime environment (altering the process group, reparenting to init, or changing the environment). Even when using a reliable tracking method, killing an arbitrary set of processes is not possible.

To top it off - when process tracking or killing goes awry, we have no reliable means to detect when our methods fail.

There's a small, relatively unknown, corner of the Linux kernel that can help us out, the proc connector. A privileged process can connect a socket to the kernel, and receive a stream of messages about processes on the system. Any time one of the following system calls happens:

fork/clone
exec
exit
setuid
setgid
setsid

for a thread or a process (all the events are documented in linux/cn_proc.h in the kernel's sources), the socket receives a message containing all the relevant event details. By tracking only the the fork and exit events, one can build a process tree in memory, starting with the batch system worker process.

Because it is based on events from the kernel, not periodic polling of /proc from user space, this is a far more reliable method for tracking processes. With a little help from the kernel, the picture is already brighter!

The drawback here is that, while the message is being processed by user-space, further messages to the socket are buffered in memory. When the buffer is full, the kernel drops any further messages: the tracker will lose possibly important events. The event stream is asynchronous: the fork or exit occurs regardless of whether you process the associated message.. Unless the tracking code is particularly slow, it is likely the only case where the buffer overruns is the case where we care about: someone launching a fork bomb to escape the system.

If you have too many message, the first step is to receive less messages. One can hand the kernel a small program in an special assembly language that pre-filters messages: a message that isn't put into the queue can't overflow it! Writing these filters are a fun academic exercise, but not useful: when a fork bomb occurs, the messages the processes needs to receive are precisely the ones overflowing the buffer!

So while never failing is preferable, detecting when we have failed is acceptable: when the buffer is full, the next attempt to read a message from the socket will return a ENOBUF to indicate the buffer has overfilled. Actions can be taken: for most batch systems, a nasty email to the user and sysadmin might be sufficient. If you work for the NSA, perhaps the appropriate response is to power down the worker node and send out black helicopters.

I've taken the approach outlined here and turned it into a small package called "ProcPolice". It consists of a simple daemon which listens to the stream of events, and adds the process to an in-memory tree if the process can be tracked back to a batch system job. ProcPolice will detect when a process reparents to init and, if it is launched from the batch system and non-root, it can log the event or kill the newly-daemonized process. In testing, it is able stop off simple fork-bombs and detect more sophisticated ones.

As ProcPolice runs as a separate daemon that watches the batch system and intervenes only on daemonized processes, it can be used immediately with any batch system (Condor or PBS have been tested). ProcPolice is available in source code form from

svn://t2.unl.edu/brian/proc_police

Or as a RHEL5-compatible RPM.

ProcPolice was invented with a few specific requirements in mind:

Prevent batch system-based processes from outliving the lifetime of the batch system job without changing the runtime of the job itself.
Do this without support in the batch system itself.
Detect when failures occur.
Support RHEL5 (the OS used by the LHC for the next few years).

It turns out the last requirement is perhaps the most stringent one; newer kernels have a specific feature for tracking and controlling arbitrary sets of processes. This is the topic of the next part of this series.