Tuesday, July 12, 2011

Squid Caching in OSG Environment

A few months back I assisted a research group from University of Nebraska Medical Center (UNMC) in deploying a search for mass spectrometry-based proteomics analysis. This search was performed using a program called The Open Mass Spectrometry Search Algorithm (OMSSA) using the Open Science Grid (OSG) via GlideinWMS Frontend. In this blog I will talk about the motivation and use of HTTP file transfer along with squid caching for input data and executable files for the jobs deployed over the OSG. I will also show a basic example explaining the use of Squid in the OSG environment.

While working with the UNMC research group and after looking at the OMSSA specifications and documentation we identified the following characteristics regarding the computation and the data handling requirements for the proteomics analysis:
•    A total of 45 datasets with each dataset of about 21MB.
•    22,000 comparisons/searches (short jobs) per dataset
•   The executables along with search libraries for the comparison sum up to a total of 83MB as a compressed archive.
 Based on the above requirements and a few additional tests it was determined that the job is well adapted for OSG via GlideinWMS. It was also decided that each GlideinWMS job will contain about 172 comparisons which calculates to a total of about 5756 individual jobs (22000*45/172).

Data in the Open Science Grid has always been more difficult to handle than computation. The challenges get more difficult when either of the number of jobs, or the data size increase. There are various methods that are used to overcome and simplify these challenges. Table 1 below shows a rule of thumb that I generally follow to help identify the best mode of data transfers for jobs in OSG environment. Each data transfer method in  Table 1 has its own advantages viz. Condor’s internal file transfer is built-in method so no extra scripting is required. SRM can handle large data stores, and has the ability to handle large size data transfers. Pre-staging can distribute the load of pulling down data.

Table 1. Rule of thumb for data transfer using condor/GlideinWMS jobs in OSG
Data SizeData Transfer Method
< 10MBCondor's File Transfer Mechanism
10MB - 500MB Storage Element(SE)/Storage Resource Manager(SRM) interface
> 500MBSRM/dCache or Pre-staging

When the number of jobs are significantly large and the data transfer size reaches the higher limits of Condor Internal File transfer, in our past experience we have found that HTTP file transfer has been fairly successful for us. By doing so we are able to distribute away the load of input file and executables transfer from the GlideinWMS  Frontend server.  For the proteomics analysis project since the compressed archive of the search library and the executables (83MB) was the same across all jobs, and the input data was the same for individual datasets we decided to extend the limits on our HTTP file transfer experiences by adding squid caching. The advantage of caching becomes more evident when more jobs are allocated compute nodes at a given site having a local (site specific) squid server until we reach the limit of the squid server itself.

Every CMS and ATLAS site is required to have squid whose location is available via the environment variable OSG_SQUID_LOCATION. This implies that by using a very simple wrapper script on a compute node one can easily pull down input files and/or executables using client tool such as wget or curl and then proceed with the actual computation. The example below shows a bash script that reads the OSG_SQUID_LOCATION environment variable on a compute node and then tries to download the file via squid, on a failure the script downloads the file directly from the source.  (Ref: https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics)


#!/bin/sh
website=http://google.com/

#Section A
source $OSG_GRID/setup.sh
export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi

#Section B
wget --retry-connrefused --waitretry=20 $website

#Section C 
#Check if the download worked
if [ $? -ne 0 ]
then
   unset http_proxy
   wget --retry-connrefused --waitretry=20 $website
   if [ $? -ne 0 ]
   then
      exit 1
   fi
fi

Listed below is the explanation of the above code:
  • Section A: Check if environment variable OSG_SQUID_LOCATION is set, if so then export its value as the environment variable http_proxy which is used by wget for  squid server location
  • Section B: Download the file using wget, the flag --retry-connrefused considers a connection refused as a transient error and tries again. This option helps to handle short term failures. The wait time of 20 seconds in between retries  is specified via --waitretry
  • Section C: If download from the squid server fails then access the actual http source after unsetting the value of http_proxy

In addition to the availability of a OSG site specific squid server, for this type of data transfer to work one will require a reliable http server which can handle download requests from sites where the squid server is unavailable. Also, the http server must be able to handle requests which are originating from the squid servers along with any failover requests. At UNL we have setup a dedicated HTTP serving infrastructure that has a load balanced failover. This is implemented using the Linux Virtual server and its implementation details are shown in the diagram below.





Friday, July 8, 2011

Part III: Bulletproof process tracking with cgroups

Finally, it's time to provide a good solution for accomplishing process tracking in a Linux batch system.
If you recall in Part I, we surveyed common methods for process tracking and ultimately concluded that batch systems used userspace mechanisms (most of which were originally designed for shell-based process control, by the way) that were unreliable, or couldn't detect when failures occur.  In Part II, the picture brightened: the kernel provided an event feed about process births and deaths, and informed us when messages were dropped.

In this post, we'll talk about a new feature called "cgroups", short for "control groups".  Cgroups are a mechanism in the Linux kernel for managing a set of processes and all their descendents.  They are managed through a filesystem-like interface (in the manner of /proc); the directory structure expresses the fact they are hierarchical, and filesystem permissions can be used to restrict the set of users allowed to manipulate them.  By default, only root is allowed to manipulate control groups: unlike the process groups, process trees, and environment cookies examined before, a process typically has no ability to change its group.  Further, unlike the proc connector API, the control group is assigned synchronously by the kernel at process creation time.  Hence, fork-bombs are not an effective way to escape from the group.

While having the tracking done by the kernel is an immense improvement, the true power of cgroups become apparent through the use of multiple subsystems.  Different cgroup subsystems may act to control scheduler policy, allocate or limit resources, or account for usage.

For example, the memory controller can be used to limit the amount of memory used by a set of processes.  This is a huge improvement over the previous memory limit technique (rlimit), where the limit was assigned per-process.  With rlimit, you could limit a single process to 1GB, but the job would just spawn N processes of 1GB each, sidestepping your limits.  In the kernel shipped with Fedora 15, 10 controllers are active by default.  For more information, you can check the documentation:
If you are a Redhat customer, I find the RHEL6 manual has the best cgroups documentation out there.

To see cgroups in action, use the systemd-cgls command found on Fedora 15.  This will print out the current hierarchy of all cgroups.  Here's what I see on my system (output truncated for display reasons):

├ condor
│ ├ 17948 /usr/sbin/condor_master -f
│ ├ 17949 condor_collector -f
│ ├ 17950 condor_negotiator -f
│ ├ 17951 condor_schedd -f
│ ├ 17952 condor_startd -f
│ ├ 17953 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 48...
│ └ 18224 condor_procd -A /var/run/condor/procd_pipe.STARTD -R 10000000 -S 60 -C 48...
├ user
│ ├ root
│ │ └ master
│ │   └ 6879 bash
│ └ bbockelm
│   ├ 1168
│   │ ├ 21426 sshd: bbockelm [priv]
│   │ ├ 21429 sshd: bbockelm@pts/3
│   │ ├ 21430 -bash
│   │ └ 21530 systemd-cgls
│   ├ 309
│   │ ├  1110 /usr/libexec/gvfsd-http --spawner :1.4 /org/gtk/gvfs/exec_spaw/0
│   │ ├  6198 gnome-terminal
│   │ ├  6202 gnome-pty-helper 
(output trimmed) 
└ system
  ├ 1 /bin/systemd --log-level info --log-target syslog-or-kmsg --system --dump...
  ├ sendmail.service
  │ ├ 8603 sendmail: accepting connections
  │ └ 8612 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
  ├ auditd.service
  │ ├ 8542 auditd
  │ ├ 8544 /sbin/audispd
  │ └ 8552 /usr/sbin/sedispatch
  ├ sshd.service
  │ └ 7572 /usr/sbin/sshd 
(output trimmed)

All of the processes in my system are in the / cgroup; all login shells are placed inside a cgroup named
/user/$USERNAME
; each system service (such as ssh) is located inside a cgroup named
/system/$SERVICENAME
; finally, there's a special one named
/condor
; More on
/condor
later.

To see the cgroups for the current process, you can do the following:
[bbockelm@mydesktop ~]$ cat /proc/self/cgroup 
10:blkio:/
9:net_cls:/
8:freezer:/
7:devices:/
6:memory:/
5:cpuacct:/
4:cpu:/
3:ns:/
2:cpuset:/
1:name=systemd:/user/bbockelm/1168
Note that each processes is not necessarily in one cgroup. The rules are that a process can have one cgroup per mount, there is one or more controller per mount, and a controller can only be mounted once.

Each controller has statistics accessible via proc.  For example, on Fedora 15, if I want to see how much memory all of my login shells are using, I can do the following:

[bbockelm@rcf-bockelman ~]$ cat /cgroups/memory/condor/memory.usage_in_bytes 
34365440

But what about the batch system?
I hope our readers can see the immediate utility in having a simple mechanism for unescapable process tracking.  We examined one such mechanism before (adding a secondary GID per batch job), but it has a small drawback in that the secondary GID can be used to create permanent objects (files owned by the secondary GID) which outlive the lifetime of the batch job.

But, even in Part I of the series, we concluded that a perfect process tracking mechanism is not enough: we also need to be able to kill processes when the batch job is finished!  The cgroups developer must have come to the same conclusion, as one controller is called the freezer.  The freezer cgroup simply stops any process from receiving CPU time from the kernel.  All process in the cgroups are frozen - and there is no way for a process to know it is about to freeze, as they aren't informed via signals.  Hence, a process tracker can freeze the processes, send them all SIGKILL, and unfreeze them.  All processes will end immediately; none will have the ability to hide in the /proc system or spawn new children in a race condition.

If you look at the first process tree posted, there is a cgroup called "condor".  As I presented at Condor Week 2011, condor is now integrated with cgroups.  It can be started in a cgroup the sysadmin specifies (such as /condor), and it will create a unique cgroup for each job (/cgroup/job_$CLUSTERID_$PROC_ID).  It uses whatever controllers are active on the system to try and track memory consumption, CPU time, and block I/O.  When the job ends or is killed, the freezer controller is used to clean up any processes.

Conclusions
As the disparate scientific clusters have become increasingly linked through the use of grids, improved process tracking has become more important.  Many sites have users from across the nation; it's no longer possible for a sysadmin to be good friends with each user.  Some have jobs with questionable quality; some have with virus-ridden laptops.

In the end, traditional process tracking in batch systems is not really ready for modern users.  Most modern batch systems no longer rely solely on the original Unix grouping mechanisms, but will fall to user malicious users.  The problem is not solvable only from user space.

Luckily, with the proc connector API (for any Linux 2.6 kernel) and cgroups (for recent Kernels), we can greatly improve the state of the art.  The folks contributing to the Linux kernel is broad, but I understand much of the contributions for cgroups has come from the OpenVZ folks: thanks guys!.

As I've been exploring this subject, I have been implementing cgroup usage in Condor: I think it's a great new feature.  They will be released with Condor 7.7.0, due in a few days.  There's no reason other batch systems can't also adopt cgroups for process tracking: I hope the spread widely in the future!