A few months back I assisted a research group from University of Nebraska Medical Center (UNMC) in deploying a search for mass spectrometry-based proteomics analysis. This search was performed using a program called The Open Mass Spectrometry Search Algorithm (OMSSA) using the Open Science Grid (OSG) via GlideinWMS Frontend. In this blog I will talk about the motivation and use of HTTP file transfer along with squid caching for input data and executable files for the jobs deployed over the OSG. I will also show a basic example explaining the use of Squid in the OSG environment.
While working with the UNMC research group and after looking at the OMSSA specifications and documentation we identified the following characteristics regarding the computation and the data handling requirements for the proteomics analysis:
• A total of 45 datasets with each dataset of about 21MB.
• 22,000 comparisons/searches (short jobs) per dataset
• The executables along with search libraries for the comparison sum up to a total of 83MB as a compressed archive.
Based on the above requirements and a few additional tests it was determined that the job is well adapted for OSG via GlideinWMS. It was also decided that each GlideinWMS job will contain about 172 comparisons which calculates to a total of about 5756 individual jobs (22000*45/172).
Data in the Open Science Grid has always been more difficult to handle than computation. The challenges get more difficult when either of the number of jobs, or the data size increase. There are various methods that are used to overcome and simplify these challenges. Table 1 below shows a rule of thumb that I generally follow to help identify the best mode of data transfers for jobs in OSG environment. Each data transfer method in Table 1 has its own advantages viz. Condor’s internal file transfer is built-in method so no extra scripting is required. SRM can handle large data stores, and has the ability to handle large size data transfers. Pre-staging can distribute the load of pulling down data.
When the number of jobs are significantly large and the data transfer size reaches the higher limits of Condor Internal File transfer, in our past experience we have found that HTTP file transfer has been fairly successful for us. By doing so we are able to distribute away the load of input file and executables transfer from the GlideinWMS Frontend server. For the proteomics analysis project since the compressed archive of the search library and the executables (83MB) was the same across all jobs, and the input data was the same for individual datasets we decided to extend the limits on our HTTP file transfer experiences by adding squid caching. The advantage of caching becomes more evident when more jobs are allocated compute nodes at a given site having a local (site specific) squid server until we reach the limit of the squid server itself.
Every CMS and ATLAS site is required to have squid whose location is available via the environment variable OSG_SQUID_LOCATION. This implies that by using a very simple wrapper script on a compute node one can easily pull down input files and/or executables using client tool such as wget or curl and then proceed with the actual computation. The example below shows a bash script that reads the OSG_SQUID_LOCATION environment variable on a compute node and then tries to download the file via squid, on a failure the script downloads the file directly from the source. (Ref: https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics)
Listed below is the explanation of the above code:
In addition to the availability of a OSG site specific squid server, for this type of data transfer to work one will require a reliable http server which can handle download requests from sites where the squid server is unavailable. Also, the http server must be able to handle requests which are originating from the squid servers along with any failover requests. At UNL we have setup a dedicated HTTP serving infrastructure that has a load balanced failover. This is implemented using the Linux Virtual server and its implementation details are shown in the diagram below.
While working with the UNMC research group and after looking at the OMSSA specifications and documentation we identified the following characteristics regarding the computation and the data handling requirements for the proteomics analysis:
• A total of 45 datasets with each dataset of about 21MB.
• 22,000 comparisons/searches (short jobs) per dataset
• The executables along with search libraries for the comparison sum up to a total of 83MB as a compressed archive.
Based on the above requirements and a few additional tests it was determined that the job is well adapted for OSG via GlideinWMS. It was also decided that each GlideinWMS job will contain about 172 comparisons which calculates to a total of about 5756 individual jobs (22000*45/172).
Data in the Open Science Grid has always been more difficult to handle than computation. The challenges get more difficult when either of the number of jobs, or the data size increase. There are various methods that are used to overcome and simplify these challenges. Table 1 below shows a rule of thumb that I generally follow to help identify the best mode of data transfers for jobs in OSG environment. Each data transfer method in Table 1 has its own advantages viz. Condor’s internal file transfer is built-in method so no extra scripting is required. SRM can handle large data stores, and has the ability to handle large size data transfers. Pre-staging can distribute the load of pulling down data.
Data Size | Data Transfer Method |
---|---|
< 10MB | Condor's File Transfer Mechanism |
10MB - 500MB | Storage Element(SE)/Storage Resource Manager(SRM) interface |
> 500MB | SRM/dCache or Pre-staging |
When the number of jobs are significantly large and the data transfer size reaches the higher limits of Condor Internal File transfer, in our past experience we have found that HTTP file transfer has been fairly successful for us. By doing so we are able to distribute away the load of input file and executables transfer from the GlideinWMS Frontend server. For the proteomics analysis project since the compressed archive of the search library and the executables (83MB) was the same across all jobs, and the input data was the same for individual datasets we decided to extend the limits on our HTTP file transfer experiences by adding squid caching. The advantage of caching becomes more evident when more jobs are allocated compute nodes at a given site having a local (site specific) squid server until we reach the limit of the squid server itself.
Every CMS and ATLAS site is required to have squid whose location is available via the environment variable OSG_SQUID_LOCATION. This implies that by using a very simple wrapper script on a compute node one can easily pull down input files and/or executables using client tool such as wget or curl and then proceed with the actual computation. The example below shows a bash script that reads the OSG_SQUID_LOCATION environment variable on a compute node and then tries to download the file via squid, on a failure the script downloads the file directly from the source. (Ref: https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics)
#!/bin/sh website=http://google.com/ #Section A source $OSG_GRID/setup.sh export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE} if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then export http_proxy=$OSG_SQUID_LOCATION fi #Section B wget --retry-connrefused --waitretry=20 $website #Section C #Check if the download worked if [ $? -ne 0 ] then unset http_proxy wget --retry-connrefused --waitretry=20 $website if [ $? -ne 0 ] then exit 1 fi fi
Listed below is the explanation of the above code:
- Section A: Check if environment variable OSG_SQUID_LOCATION is set, if so then export its value as the environment variable http_proxy which is used by wget for squid server location
- Section B: Download the file using wget, the flag --retry-connrefused considers a connection refused as a transient error and tries again. This option helps to handle short term failures. The wait time of 20 seconds in between retries is specified via --waitretry
- Section C: If download from the squid server fails then access the actual http source after unsetting the value of http_proxy
In addition to the availability of a OSG site specific squid server, for this type of data transfer to work one will require a reliable http server which can handle download requests from sites where the squid server is unavailable. Also, the http server must be able to handle requests which are originating from the squid servers along with any failover requests. At UNL we have setup a dedicated HTTP serving infrastructure that has a load balanced failover. This is implemented using the Linux Virtual server and its implementation details are shown in the diagram below.
You can see more detailed examples of squid usage at https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics
There is also an excellent presentation by Derek Weitzel available at http://docs.google.com/viewer?url=https%3A%2F%2Ftwiki.grid.iu.edu%2Ftwiki%2Fpub%2FCampusGrids%2FApr27%252c2011%2FCampusGridSquid.pdf&embedded=true
There is also an excellent presentation by Derek Weitzel available at http://docs.google.com/viewer?url=https%3A%2F%2Ftwiki.grid.iu.edu%2Ftwiki%2Fpub%2FCampusGrids%2FApr27%252c2011%2FCampusGridSquid.pdf&embedded=true