OSG Technology Area Rumblings: 2011

Thursday, December 29, 2011

What's the hold-up?

Do you have the following diagram memorized?

If your site runs Condor, you probably should. It shows the states of the condor_startd, the activities within the state, and the transitions between them. If you want to have jobs reliably pre-empted (or is that killed? Or vacated?) from the worker node for something like memory usage, a clear understanding is required.

However, the 30 state transitions might be a bit much for some site admins who just want to kill jobs that go over a memory limit. In such a case, admins can utilize the SYSTEM_PERIODIC_REMOVE or the SYSTEM_PERIODIC_HOLD configuration parameters on the condor_schedd to respectively remove or hold jobs.

These expressions periodically evaluate the schedd's copy of the job ClassAd (by default, once every 60s); if they evaluate to true for a given job, they will remove or hold it. This will almost immediately preempt execution on the worker node.

[Note: While effective and simple, these are not the best way to accomplish these sort of policies! As the worker node may talk to multiple schedd's (via flocking, or just through a complex pool with many schedd's), it's best to express the node's preferences locally.]

At HCC, the periodic hold and release policy looks like this:

# hold jobs using absurd amounts of disk (100+ GB)
SYSTEM_PERIODIC_HOLD = \
   (JobStatus == 1 || JobStatus == 2) && ((DiskUsage > 100000000 || ResidentSetSize > 1600000))

# forceful removal of running after 2 days, held jobs after 6 hours,
# and anything trying to run more than 10 times
SYSTEM_PERIODIC_REMOVE = \
   (JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*6) || \
   (JobStatus == 2 && CurrentTime - EnteredCurrentStatus > 3600*24*2) || \
   (JobStatus == 5 && JobRunCount >= 10) || \
   (JobStatus == 5 && HoldReasonCode =?= 14 && HoldReasonSubCode =?= 2)

We place anything on hold that goes over some pre-defined resource limit (disk usage or memory usage). Jobs are removed if they have been on hold for a long time, have run for too long, have restarted too many times, or are missing their input files.

Note that this is a flat policy for the cluster - heterogeneous nodes with larges amounts of RAM per core would not be well-utilized. We could tweak this by having users utilize the RequestMemory attribute to their job's ad (defaulting to 1.6GB), place into the Requirements that the slot have sufficient memory, and have the node only accept jobs that request memory below a certain threshold. The expression above could then be tweaked to hold jobs where (ResidentSetSize > RequestMemory). Perhaps more on that in the future if we go this route.

While the SYSTEM_PERIODIC_* expressions are useful, Dan Bradley recently introduce me to the SYSTEM_PERIODIC_*_REASON parameter. This allows you to build a custom hold message for the user whose jobs you're about to interrupt. The expression is evaluated within the context of the job's ad, and the resulting string is placed in the job's HOLD_REASON. As an example, previously, the hold message was something bland and generic:

The SYSTEM_PERIODIC_HOLD expression evaluated to true.

Why did it evaluate to true? Was it memory or disk usage? When it was held, how bad was the disk/memory usage? These things can get lost in the system. Oops. We added the following to our schedd's configuration:

# Report why the job went on hold.
SYSTEM_PERIODIC_HOLD_REASON = \
   strcat("Job in status ", JobStatus, \
   " put on hold by SYSTEM_PERIODIC_HOLD due to ", \
   ifThenElse(isUndefined(DiskUsage) || DiskUsage < 100000000, \
      strcat("memory usage ", ResidentSetSize), \
      strcat("disk usage ", DiskUsage)), ".")

Now, we have beautiful error messages in the user's logs explaining the issue:

Job in status 2 put on hold by SYSTEM_PERIODIC_HOLD due to memory usage 1620340."

One less thing to get confused about!

Friday, December 23, 2011

A simple iRODS Micro-Service

Introduction

The goal I had for this task was to identify and understand the steps and configurations involved in writing a micro-service and seeing it in action - for details regarding iRODS please refer to documentation at https://www.iRODS.org/. The micro-service that I wrote is very simplistic (it writes a hello world message to the system log), however it serves its purpose by providing an overview of steps that will be involved in writing a useful micro-service.

Before I document the configurations and codes involved in creating and registering the new micro-service let’s look at figure 1.

Figure 1 shows a high level view of invocation of a micro-service by the iRODS rules engine. One way of looking at the micro-service and the iRODS rule engine is to think of it as an event based triggering system that can perform ‘operations’ on the data objects, and/or external resources. The micro-services are registered in iRODS rule definitions and the rule engine invokes them based on the condition specified for that rule. For a list of places in the iRODS workflow where a micro-service may be triggered please visit: https://www.irods.org/index.php/Default_iRODS_Rules.

Also you may refer to https://www.iRODS.org/index.php/Rule_Engine for a detailed diagram of a micro-service invocation.

Figure 2 above shows the communication between the iRODS rule engine and a micro-service. A simplistic view of the communication layers is that the rule engine calls a defined C procedure, which exposes its functionality through an interface (commonly prefixed with msi). The arguments to the procedure are passed through a structure named msParam_t that is defined below:

typedef struct MsParam {
  char *label;
  char *type;         /* this is the name of the packing instruction in
                       * rodsPackTable.h */
  void *inOutStruct;
  bytesBuf_t *inpOutBuf;
} msParam_t;

Writing the micro-service

Figure 3 shows the steps involved in creating a new micro-service:

Write the C procedure

The C code below (lets call it test.c) has a function writemessage that writes a message to the system log. There is an interface to the function named msiWritemessage which exposes the writemessage function. The msi function takes a list of arguments of type msParam_t and a last argument of type ruleExecInfo_t for the result of the operation.

#include <stdio.h>
#include <unistd.h>
#include <syslog.h>
#include <string.h>
#include "apiHeaderAll.h"


void writemessage(char arg1[], char arg2[]);
int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei);


void writemessage(char arg1[], char arg2[]) {
    openlog("slog", LOG_PID|LOG_CONS, LOG_USER);
    syslog(LOG_INFO, "%s %s from micro-service", arg1, arg2);
    closelog();
}

int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei)
{
 char *in1;
 int *in2;
 RE_TEST_MACRO ("    Calling Procedure");
 // the above line is needed for loop back testing using irule -i option
 if ( strcmp( mParg1->type, STR_MS_T ) == 0 )
 {
    in1 = (char*) mParg1->inOutStruct;
 }
 if ( strcmp( mParg2->type, INT_MS_T ) == 0 )
 {
    in2 = (int*) mParg2->inOutStruct;
 }
 writemessage(in1, in1);
 return rei->status;
}

Next I will make a folder structure in the module folder of iRODS home for placing this micro-service and copy a few files from an example properties module and modify them to fit the test.c micro-service

cd ~irods
mkdir modules/HCC
cd modules/HCC

mkdir microservices
mkdir rules
mkdir lib
mkdir clients
mkdir servers

mkdir microservices/src
mkdir microservices/include
mkdir microservices/obj
cp ../properties/Makefile .
cp ../properties/info.txt .

Listed below is my working copy of Makefile and the info.txt

#Makefile
ifndef buildDir
buildDir = $(CURDIR)/../..
endif

include $(buildDir)/config/config.mk
include $(buildDir)/config/platform.mk
include $(buildDir)/config/directories.mk
include $(buildDir)/config/common.mk

#
# Directories
#
MSObjDir =    $(modulesDir)/HCC/microservices/obj
MSSrcDir =    $(modulesDir)/HCC/microservices/src
MSIncDir =    $(modulesDir)/HCC/microservices/include

# Source files

OBJECTS =    $(MSObjDir)/test.o


# Compile and link flags
#
INCLUDES +=    $(INCLUDE_FLAGS) $(LIB_INCLUDES) $(SVR_INCLUDES)
CFLAGS_OPTIONS := $(CFLAGS) $(MY_CFLAG)
CFLAGS =    $(CFLAGS_OPTIONS) $(INCLUDES) $(MODULE_CFLAGS)

.PHONY: all server client microservices clean
.PHONY: server_ldflags client_ldflags server_cflags client_cflags
.PHONY: print_cflags

# Build everytying
all:    microservices
    @true

# List module's objects and needed libs for inclusion in clients
client_ldflags:
    @true

# List module's includes for inclusion in the clients
client_cflags:
    @true

# List module's objects and needed libs for inclusion in the server
server_ldflags:
    @echo $(OBJECTS) $(LIBS)

# List module's includes for inclusion in the server
server_cflags:
    @echo $(INCLUDE_FLAGS)

# Build microservices
microservices:    print_cflags $(OBJECTS)

# Build client additions
client:
    @true

# Build server additions
server:
    @true

# Build rules
rules:
    @true

# Clean
clean:
    @echo "Clean image module..."
    rm -rf $(MSObjDir)/*.o


# Show compile flags
print_cflags:
    @echo "Compile flags:"
    @echo "    $(CFLAGS_OPTIONS)"

# Compile targets
#
$(OBJECTS): $(MSObjDir)/%.o: $(MSSrcDir)/%.c $(DEPEND)
    @echo "Compile image module `basename $@`..."
    @$(CC) -c $(CFLAGS) -o $@ $<

info.txt

Name:        HCC
Brief:        HCC Test microservice
Description:    HCC Test microservice.
Dependencies:
Enabled:    yes
Creator:    Ashu Guru
Created:    December 2011
License:    BSD

In the next step I will define the micro-service header and micro-service table files so that the iRODS can be configured with the new micro-service. This is done in the folder microservices/include. In this example there is no header for this code so I have left the header file blank; in the micro-service table file I have the entry for the table definition. The specifics to note below are that the first argument is the label of the micro-service, the second argument is the count of input arguments (do not count the ruleExecInfo _t argument) of the msi interface and the third argument is the name of the msi interface function.

File microservices/include/microservices.table

{ "msiWritemessage",2,(funcPtr) msiWritemessage },

Following is the directory tree structure for the HCC module that I have so far:

bash-4.1$ pwd

/opt/iRODS/modules
bash-4.1$ tree HCC
HCC
├── clients
├── info.txt
├── lib
├── Makefile
├── microservices
│   ├── include
│   │   ├── microservices.header
│   │   ├── microservices.table
│   ├── obj
│   └── src
│       ├── test.c
├── rules
└── servers

Next I will make an entry for enabling the new module (this micro-service), this is done in the file ~irods/config/config.mk so that the iRODS Makefile can include the new micro-service for build. To do this simply add the module folder name (in my case HCC) to the variable MODULES.

Compile and test

cd ~irods/modules/<YOURMODULENAME>
make

The above commands should result in creation of an object file in the micro-service/obj folder. I am going to test the micro-service manually first, to accomplish this I will create a client side rule file in the folder ~irods/ clients/icommands/test/rules. I have named the file aguru.ir and following are the contents of the file:

aguruTest||msiWritemessage(*A,*B)|nop
*A=helloworld%*B=testing

The first line in file is the rules definition and the second line are the input parameters. To test the micro-service I will invoke the micro-service which will then write a message to the system log (see figure below).

Recompile iRODS

Before this step I must make the entries for the headers and the msi table in the iRODS main micro-service action table (i.e. file ~irods/server/re/include/reAction.h). This should be done using the following commands:

rm server/re/include/reAction.h
make reaction

However, I had to manually add the code segment below to the file server/re/include/reAction.h file to accomplish that:

int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,  ruleExecInfo_t *rei);

Finally, recompile iRODS

cd ~irods
make test_flags
make modules
./irodsctl stop
make clean
make
./irodsctl start
./irodsctl status

Register Micro-service and Test

In this step we define a rule that will trigger the micro-service when a new data object is uploaded to iRODS. Open the file ~irods/server/config/reConfigs/core.re and add the following line the Test Rules section.

acPostProcForPut {msiWritemessage("HelloWorld","String 2"); }

That is it… if now I put (iput) any file into iRODS a message is added to the /var/log/messages file on the iRODS server. Please note that the above rule is not filtering a particular occurrence but is a catchall rule that applies to all put events.

References:
https://www.irods.org/
http://www.wrg.york.ac.uk/iread/compiling-and-running-irods-with-micros-services
http://technical.bestgrid.org/index.php/IRODS_deployment_plan

Thursday, December 15, 2011

How to create openstack controller

As before, the "official" instructions on which our procedure is based are here:

http://docs.openstack.org/cactus/openstack-compute/admin/content/installing-openstack-compute-on-rhel6.html

First setup the repository:

wget http://yum.griddynamics.net/yum/cactus/openstack/openstack-repo-2011.2-1.el6.noarch.rpm
rpm -ivh openstack-repo-2011.2-1.el6.noarch.rpm

Then install openstack and dependencies

yum install libvirt
chkconfig libvirtd on
/etc/init.d/libvirtd start
yum install euca2ools openstack
nova-{api,compute,network,objectstore,scheduler,volume} openstack-nova-cc-config openstack-glance

Start services:

service mysqld start
chkconfig mysqld on
service rabbitmq-server start
chkconfig rabbitmq-server on

Setup database authorisations. First set up root password:

mysqladmin -uroot password

Now, to automate the procedure create an executable shell script

openstack-db-setup.sh

with the following content (fill the relevant user name and password fields as well as the IP's):

#!/bin/bash

DB_NAME=nova
DB_USER=
DB_PASS=
PWD=

#CC_HOST="A.B.C.D" # IPv4 address
CC_HOST="130.199.148.53" # IPv4 address, fill your own
#HOSTS='node1 node2 node3' # compute nodes list
HOSTS='130.199.148.54' # compute nodes list, fill your own

mysqladmin -uroot -p$PWD -f drop nova
mysqladmin -uroot -p$PWD create nova

for h in $HOSTS localhost; do
echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO '$DB_USER'@'$h' IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql
done
echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO $DB_USER IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql
echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO root IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql

And now execute this script:

./openstack-db-setup.sh

Create db schema

nova-manage db sync

Now comes point which is not in the "official" instructions. The installation will not work unless you patch your python:

patch -p0 < rhel6-nova-network-patch.diff

Create logical volumes:

lvcreate -L 1G --name test nova-volumes

For your convenience create an openstack startup shell script openstack-init.sh

Here is its content:

#!/bin/bash
for n in api compute network objectstore scheduler volume; do
service openstack-nova-$n $@;
done
service openstack-glance-api $@

And finally we are ready to start openstack:

openstack-init.sh start

With fingers crossed you should get

Starting OpenStack Nova API Server: [ OK ]
Starting OpenStack Nova Compute Worker: [ OK ]
Starting OpenStack Nova Network Controller: [ OK ]
Starting OpenStack Nova Object Storage: [ OK ]
Starting OpenStack Nova Scheduler: [ OK ]
Starting OpenStack Nova Volume Worker: [ OK ]
Starting OpenStack Glance API Server: [ OK ]

Now we need to configure and customize the installation which is another story for another day...

./openstack-init.sh start

If everything goes fine

Starting OpenStack Nova API Server: [ OK ]
Starting OpenStack Nova Compute Worker: [ OK ]
Starting OpenStack Nova Network Controller: [ OK ]
Starting OpenStack Nova Object Storage: [ OK ]
Starting OpenStack Nova Scheduler: [ OK ]
Starting OpenStack Nova Volume Worker: [ OK ]
Starting OpenStack Glance API Server: [ OK ]

How to create openstack worker node

The "official" instructions how to install openstack components are located here:

http://docs.openstack.org/cactus/openstack-compute/admin/content/installing-openstack-compute-on-rhel6.html

Unfortunately they are not very clear and miss some key points. Below is summary of our installation procedure.

First of all, let us install worker node.

wget http://yum.griddynamics.net/yum/cactus/openstack/openstack-repo-2011.2-1.el6.noarch.rpm
rpm -ivh openstack-repo-2011.2-1.el6.noarch.rpm
yum install libvirt
chkconfig libvirtd on
/etc/init.d/libvirtd start
yum install openstack-nova-compute openstack-nova-compute-config
service openstack-nova-compute start

If everything goes fine you should see

Starting OpenStack Nova Compute Worker: [ OK ]

Thursday, December 8, 2011

Network Accounting for Condor

It's been a long time since the August post describing how to set up manual network accounting for a process. We now have a solution integrated into Condor and available on github. It requires a bit to understand how it works, so I've put together a series of diagrams to illustrate it.

First, we start off with the lowly condor_starter on any worker node with an network connection (to simplify things, I didn't draw the other condor processes involved):

By default, all processes on the node are in the same network namespace (labelled the "System Network Namespace" in this diagram). We denote the network interface with a box, and assume it has address 192.168.0.1.

Next, the starter will create a pair of virtual ethernet devices. We will refer to them as pipe devices, because any byte written into one will come out of the other - just how a venerable Unix pipe works:

By default, the network pipes are in a down state and have no IP address associated with them. Not very useful! At this point, we have some decisions to make: how should the network pipe device be presented to the network? Should it be networked at layer 3, using NAT to route packets? Or should we bridge it at layer 2, allowing the device to have a public IP address?

Really, it's up to the site, but we assume most sites will want to take the NAT approach: the public IP address might seem useful, but would require a public IP for each job. To allow customization, all the routing is done by a helper script, but provide a default implementation for NAT. The script:

Takes two arguments, a unique "job identifier" and the name of the network pipe device.
Is responsible for setting up any routing required for the device.
Must create an iptables chain using the same name of the "job identifier".

Each rule in the chain will record the number of bytes matched; at the end of the job, these will be reported in the job ClassAd using an attribute name identical to the comment on the rule.

On stdout, returns the IP address the internal network pipe should use.

Additionally, the Condor provides a cleanup script does the inverse of the setup script. The result looks something like this:

Next, the starter forks a separate process in a new network namespace using the clone() call with the CLONE_NEWNET flag. Notice that, by default, no network devices are accessible in the new namespace:

Next, the external starter will pass one side of the pipe to the other namespace; the internal stater will do some minimal configuration of the device (default route, IP address, set the device to the "up" status):

Finally, the starter exec's to the job. Whenever the job does any network operations, the bytes are routed via the internal network pipe, come out the external network pipe, and then are NAT'd to the physical network device before exiting the machine.

As mentioned, the whole point of the exercise is to do network accounting. Since all packets go through one device, Condor can read out all the activity via iptables. The "helper script" above will create a unique chain per job. This allows some level of flexibility; for example, the chain below allows us to distinguish between on-campus and off-campus packets:

Chain JOB_12345 (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 ACCEPT     all  --  veth0  em1     anywhere             129.93.0.0/16        /* OutgoingInternal */
    0     0 ACCEPT     all  --  veth0  em1     anywhere            !129.93.0.0/16        /* OutgoingExternal */
    0     0 ACCEPT     all  --  em1    veth0   129.93.0.0/16        anywhere             state RELATED,ESTABLISHED /* IncomingInternal */
    0     0 ACCEPT     all  --  em1    veth0  !129.93.0.0/16        anywhere             state RELATED,ESTABLISHED /* IncomingExternal */
    0     0 REJECT     all  --  any    any     anywhere             anywhere             reject-with icmp-port-unreachable

Thus, the resulting ClassAd history from this job will have an attribute for NetworkOutgoingInternal, NetworkOutgoingExternal, NetworkIncomingInternal, and NetworkIncomingInternal. We have an updated Condor Gratia probe that looks for Network* attributes and reports them appropriately to the accounting database.

Thus, we have byte-level network, allowing us to answer the age-old question of "how much would a CMS T2 cost on Amazon EC2?". Or perhaps we could answer "how much is a currently running job going to cost me?" Matt has pointed out the network setup callout could be used to implement security zones, isolating (or QoS'ing) jobs of certain users at the network level. There are quite a few possibilities!

We'll definitely be returning to this work mid-2012 when the local T2 is based on SL6, and this patch can be put into production. There will be some further engagement with the Condor team to see if they're interested in taking the patch. The Gratia probe work to manage network information will be interesting upstream too. Finally, I encourage interested readers to take a look at the github branch. The patch itself is a tour-de-force of several dark corners of Linux systems programming (involves using clone, synchronization between processes with pipes, sending messages to the kernel via netlink to configure the routing, and reading out iptables configurations using C). It was very rewarding to implement!

Thursday, December 1, 2011

Details on glexec improvements

My last blog post gave a quick overview of why glexec exists, what issues folks run into, and what we did to improve it. Let's go into some details.

How Condor Update Works
The lcmaps-plugin-condor-update package contains the modules necessary to advertise the payload certificate of the last glexec invocation in the pilot's ClassAd. The concept is simple - the implementation is a bit tricky.

For a long time, Condor has had a command-line tool called condor_advertise for awhile; it allows an admin to hand-advertise updates to ads in the collector. Unfortunately, that's not quite what we need here: we want to update the job ad in the schedd, while condor_advertise typically updates the machine ad in the collector. Close, but no cigar.

There's a lesser-known utility called condor_chirp that we can use. Typically, condor_chirp is used to do I/O between the schedd and the starter (for example, you can pull/push files on demand in the middle of the job), but it can also update the job's ad in the schedd. The syntax is simple:

condor_chirp ATTR_NAME ATTR_VAL

(look at the clever things Matt does with condor_chirp). As condor_chirp allows additional access to the schedd, the user must explicitly request it in the job ad. If you want to try it out, you must add the following line into your submit file:

+WantIOProxy=TRUE

To work, chirp must know how to contact the starter and have access to the "magic cookie"; these are located inside the $_CONDOR_SCRATCH_DIR, as set by Condor in the initial batch process. As the glexec plugin runs as root (glexec must be setuid root to launch a process as a different UID), we must guard against being fooled by the invoking user.
Accordingly, the plugin uses /proc to read the parentage of the process tree until it finds a process owned by root. If this is not init, it is assumed the process is the condor_starter, and the job's $_CONDOR_SCRATCH_DIR can be deduced from the $CWD and the PID of the starter. Since we only rely on information from root-owned processes, we can be fairly sure this is the correct scratch directory. As a further safeguard, before invoking condor_chirp, the plugin drops privilege to that of the invoking user. Along with the other security guarantees provided by glexec, we have confidence that we are reading the correct chirp configuration and are not allowing the invoker to increase its privileges.

Once we know how to invoke condor_chirp, the rest of the process is all downhill. glexec internally knows the payload's DN, the payload Unix user, and does the equivalent of the following:

condor_chirp set_job_attr glexec_user "hcc"
condor_chirp set_job_attr glexec_x509userproxysubject "/DC=org/DC=cilogon/C=US/O=University of Nebraska-Lincoln/CN=Brian Bockelman A621"
condor_chirp set_job_attr glexec_time 1322761868

condor_chirp writes the data into the starter, which then updates the shadow, then the schedd (some of the gory details are covered in the Condor wiki).

The diagram below illustrates the data flow:

Putting this into Play
If you really want to get messy, you can check out the source code from Subversion at:

svn://t2.unl.edu/brian/lcmaps-plugins-condor-update

(web view)

The current version of the plugin is 0.0.2. It's available in Koji, or via yum in the osg-development repository:

yum install --enablerepo=osg-development lcmaps-plugins-condor-update

(you must already have the osg-release RPM installed and glexec otherwise configured).

After installing it, you need to update the /etc/lcmaps.db configuration file on the worker node to invoke the condor-update module. In the top half, I add:

condor_updates = "lcmaps_condor_update.mod"

Then, I add condor-update to the glexec policy:

glexec:

verifyproxy -> gumsclient
gumsclient -> condor_updates
condor_updates -> tracking

Note we use the "tracking" module locally; most sites will use the "glexec-tracking" module. Pick the appropriate one.

Finally, you need to turn on the I/O proxy in the Condor submit file. We do this by editing condor.pm (for RPMs, located in /usr/lib/perl5/vendor_perl/5.8.8/Globus/GRAM/JobManager/condor.pm). We add the following line into the submit routine, right before queue is added to the script file:

print SCRIPT_FILE "+WantIOProxy=TRUE\n";

All new incoming jobs will get this attribute; any glexec invocations they do will be reflected at the CE!

GUMS and Worker Node Certificates

To map a certificate to a Unix user, glexec calls out to the GUMS server using XACML with a grid-interoperable profile. In the XACML callout, GUMS is given the payload's DN and VOMS attributes. The same library (LCMAPS/SCAS-client) and protocol can also make callouts directly to SCAS, more commonly used in Europe.

GUMS is a powerful and flexible authorization tool; one feature is that it allows different mappings based on the originating hostname. For example, if desired, my certificate could map to user hcc at red.unl.edu but map to cmsprod at ff-grid.unl.edu. To prevent "just anyone" from probing the GUMS server, GUMS requires the client to present X509 a certificate (in this case, the hostcert); it takes the hostname from the client's certificate.

This has the unfortunate side-effect of requiring a host certificate on every node that invokes GUMS; OK for the CE (100 in the OSG), but not for glexec on the worker nodes (thousands on the OSG).

When glexec is invoked in EGI, SCAS is invoked using the pilot certificate for HTTPS and information about the payload certificate in the XACML callout; this requires no worker node host certificate.

To replicate how glexec works in EGI, we had to develop a small patch to GUMS. When the pilot certificate is used for authentication, the pilot's DN is recorded to the logs (so we know who is invoking GUMS), but the host name is self-reported in the XACML callout. As the authentication is still performed, we believe this relaxing of the security model is acceptable.

A patched, working version of GUMS can be found in Koji and is available in the osg-development repository. It will still be a few months before the RPM-based GUMS install is fully documented and released, however.

Once installed, two changes need to be made at the server:

Do all hostname mappings based on "DN" in the web interface, not the "CN".
Any group of users (for example, /cms/Role=pilot) that want to invoke GUMS must have "read all" access, not just "read self".

Further, /etc/lcmaps.db needs to be changed to remove the following lines from the gumsclient module:

"-cert   /etc/grid-security/hostcert.pem"
"-key    /etc/grid-security/hostkey.pem"
"--cert-owner root"

This will be all automated going forward - but all should help remove some of the pain in deploying glexec!

Friday, November 11, 2011

Improving the glexec-enabled life

Pilot-based workflow management systems have had a dramatic transformation of how we view the grid today. Instead of queueing a job (the "payload") in a workflow onto a site on a grid, these systems send an "empty" job that starts up, then downloads and starts the payload from from a central endpoint. In CS terms, it switches from a model of "work delegation" to "resource allocation". By allocating the resource (i.e., starting the pilot job) prior to delegating work, users no longer have to know the vagaries/failure modes of direct grid submission and don't have to pay the price of sending their payloads to a busy site!

In short, pilot jobs make the grid much better.

However, like most concepts, pilot jobs are a trade-off: they make life easier for users, but harder for security folks and sysadmins. Pilots are sent using one certificate, but payloads are run under a different identity. If the payload job wants to act on behalf of the user, it needs to bring the user's grid credentials to the worker node. [Side note: this is actually an interesting assumption. The PanDA pilot system, heavily utilized by ATLAS, does not bring credentials to the worker node. This simplifies this problem, but opens up a different set of concerns.] If both pilot and payload are run as the same Unix user, the payload user can easily access the credentials (including the pilot credentials), executables, and output data of other running payloads.

The program glexec is a "simple" idea to solve this problem: given a set of grid credentials, launch a process under corresponding the Unix account at the site. For example, with credentials from the HCC VO:

[bbockelm@brian-test ~]$ whoami
bbockelm
[bbockelm@brian-test ~]$ GLEXEC_CLIENT_CERT=/tmp/x509up_u1221 /usr/sbin/glexec 
/usr/bin/whoami
hcc

(You'll notice the invocation is not as simple as typing "glexec whoami"; it's not exactly designed for end-user invocation). To achieve the user switching, glexec has to be setuid root. Setuid binaries must be examined under a security microscope, which have unfortunately led to a slow adoption of glexec.

The idea is that pilot jobs would wrap the payload with a call to "glexec", separating the payload from the pilot and other payloads. From there, it goes horribly wrong. Not wrong really - but rather things get sticky.

Since the pilot and payload are both low-privileged users, the pilot doesn't have permission to clean up or kill the payload. It must again use glexec to send signals and delete sandboxes. The several invocations are easy to screw up (and place load on the authorization system!). There are tricky error conditions - if authorization breaks in the middle of the job, how does the pilot clean up the payload?

As the payload is a full-fledged Linux process, it can create other processes, daemonize, escape from the batch system, etc. As previously discussed, the batch system - with root access - typically does a poor job tracking processes. The pilot will be hopeless unless we provide some assistance.

Glexec imposes an integration difficulty at some sites. There are popular cron scripts that kill process belonging to users on a node that aren't currently running batch system jobs. So, if the pilot maps to "cms" and the payload maps to "cmsuser", the batch system only knows about "cms", and the cronjob will kill all processes belonging to "cmsuser". We lost quite a few jobs at some sites before we figured this out!

Site admins manage the cluster via the batch system. Since the payload is invisible to the batch system, we're unable to kill jobs from a user with batch system tools (condor_rm, qdel). In fact, if we get an email from a user asking for help understanding their jobs, we can't even easily find where the job is running! Site admins have to ssh into each worker node and examine the running jobs; a process that is simply medieval.

Finally, on the OSG, invoking the authorization system requires host certificate credentials. This is not a problem when host certs are needed for a handful of CEs at the site, but explodes when glexec is run on each worker node. This is a piece of unique state on the worker nodes for sites to manage, adding to the glexec headache.

We're the Government. We're here to help.

The OSG Technology group has decided to tackle the three biggest site-admin usability issues in glexec:

Batch system integration: The Condor batch system provides the ability for running jobs to update the submit node with arbitrary status. We have developed a plugin that updates the job's ClassAd with the payload's DN whenever glexec is invoked.
Process tracking: There is an existing glexec plugin to do process tracking. However, this requires a admin to set up secondary GID ranges (an administration headache) and suffers the previously-documented process tracking issues. We will port the ProcPolice daemon over to the glexec plugin framework.
Worker node certificates: We propose to fix this via improvements to GUMS, allowing the mappings to be performed based on the presence of "Role=pilot" VOMS extension in the pilot certificate.

The plugins in (1) and (2) have been prototyped, and are available in the osg-development repository as "lcmaps-plugins-condor-update" and "lcmaps-plugins-process-tracking", respectively. The third item is currently cooking.

The "lcmaps-plugins-condor-update" is especially useful, as it's a brand-new capability as opposed to an improvement. It advertises three attributes in the job's ClassAd:

glexec_x509userproxysubject: The DN of the payload user.
glexec_user: The Unix username for the payload.
glexec_time: The Unix time when glexec was invoked.

We can then use it to filter and locate jobs. For example, if a user named Ian complains his jobs are running slowly, we could locate a few with the following command:

[bbockelm@t3-sl5 ~]$ condor_q -g -const 'regexp("Ian", glexec_x509userproxysubject)' -format '%s ' ClusterId -format '%s\n' RemoteHost | head
868341 slot6@red-d11n10.red.hcc.unl.edu
868343 slot7@node238.red.hcc.unl.edu
868358 slot6@red-d11n9.red.hcc.unl.edu
868366 slot2@node239.red.hcc.unl.edu
868373 slot3@node119.red.hcc.unl.edu
868741 slot8@red-d9n6.red.hcc.unl.edu
868770 slot3@red-d9n8.red.hcc.unl.edu
868819 slot5@node109.red.hcc.unl.edu
868820 slot4@node246.red.hcc.unl.edu
868849 slot2@red-d11n6.red.hcc.unl.edu

Slick!

Wednesday, October 19, 2011

KVM and Condor (Part 2): Condor configuration for VM Universe & VM Image Staging

This is Part 2 of my previous blog KVM and Condor (Part 1): Creating the virtual machine. In this blog I will share the steps for configuring Condor VM Universe, in addition I will also discuss the steps involved in staging the VM disk images. It is assumed that you have a basic setup of Condor working and there is a shared file system that is accessible from each of the worker nodes.

As a first step please make sure that the worker nodes support KVM based virtualization, if they do not, then you may use:

yum groupinstall "KVM"

and yum -y install kvm libvirt libvirt-python python-virtinst libvirt-client

Configuring Condor for KVM

For Condor to support VM universe the following attributes must be set in the Condor configuration of each of the worker nodes (this may be done by modifying the the local Condor config file)

VM_GAHP_SERVER = $(SBIN)/condor_vm-gahp
VM_GAHP_LOG = $(LOG)/VMGahpLog
VM_MEMORY = 5000
VM_TYPE = kvm
VM_NETWORKING = true
VM_NETWORKING_TYPE = nat
ENABLE_URL_TRANSFERS = TRUE
FILETRANSFER_PLUGINS = /usr/local/bin/vm-nfs-plugin

The explanation of the above attributes follow:

Attribute	Description
VM_GAHP_SERVER	The complete path and file name of the condor_vm-gahp.
VM_GAHP_LOG	The complete path and file name of the condor_vm-gahp log.
VM_MEMORY	A VM universe job is required to specify the memory needs for the disk image with vm_memory (Mbytes) in its job description file. On the worker node the value of the VM_MEMORY configuration is used for matching the memory requested by the job. VM_MEMORY is an integer value that specifies the maximum amount of memory in Mbytes that will be allowed for the virtual machine program.
VM_TYPE	This attribute can have values: kvm, xen or vmware and specify the type of supported virtual machine software.
VM_NETWORKING	Must be set to true to support networking in the VM instances.
VM_NETWORKING_TYPE	This is a string value describing the type of networking.
ENABLE_URL_TRANSFERS	This is a Boolean value when True causes the condor_starter for a job to invoke all plug-ins defined by FILETRANSFER_PLUGINS when a file transfer is specified with a URL in the job description file.
FILETRANSFER_PLUGINS	Is a comma separated list of absolute paths of executable(s) for plug-ins that will accomplish the task of file transfer when a job requests the transfer of an input file by specifying a URL.

The File Transfer Plugin

So far we have modified the configurations of the condor worker node for supporting Condor VM universe. Next I will describe a barebones FILETRANSFER_PLUGINS executable. I will use bash for scripting and the plugin will reside at :/usr/local/bin/vm-nfs-plugin on each of the worker nodes.

#!/bin/bash
#file: /usr/local/bin/vm-nfs-plugin
#----------------------------------------
# Plugin Essential
if [ "$1" = "-classad" ]
then
   echo "PluginVersion = \"0.1\""
   echo "PluginType = \"FileTransfer\""
   echo "SupportedMethods = \"nfs\""
   exit 0
fi

#----------------------------------------
# Variable definitions
# transferInputstr_format='nfs:<abs path to (nfs hosted) inputfile file>:<basename of vminstance file>'
WHICHQEMUIMG='/usr/bin/qemu-img'
initdir=$PWD
transferInputstr=$1
#-------------------------------------------
# Split the first argument to an array
IFS=':' read -ra transferInputarray <<< "$transferInputstr"
#-------------------------------------------
#create the vm instance copy on write
$WHICHQEMUIMG create -b ${transferInputarray[1]} -f  qcow2   ${initdir}/${transferInputarray[2]}
exit 0;

Overall the idea behind the above script is to create a qcow2 formatted VM instance file in the condor allocated execute folder. The details of code blocks above are listed below:

The “# Plugin Essential” part of the codes is a requirement for a Condor file transfer plug-in so that a plug-in can be registered appropriately to handle file transfers based on the methods (protocols) it supports. The condor_starter daemon invokes each plug-in with a command line argument ‘-classad’ to identify the protocols that a plug-in supports, it expects that the plug-in will respond with an output of three ClassAd attributes. The first two are fixed: PluginVersion = "0.1" and PluginType = "FileTransfer"; the third is the ClassAd attribute ‘SupportedMethods’ having a string value containing comma separated list of the protocols that the plug-in handles. Thus, in the script above SupportedMethods = "nfs" identifies that the plug-in vm-nfs-plugin supports a user defined protocol ‘nfs’. Accordingly, the ‘nfs’ string will be matched to the protocol specification as given within a URL in the transfer_input_files command in a Condor job description file.

For a file transfer invocation a plug-in is invoked with two arguments - the first being the URL specified in the job description file; and the second argument being the absolute path identifying where to place the transferred file. The plug-in is expected to transfer the file and exit with a status of 0 when the transfer is successful. A non-zero status must be returned when the transfer is unsuccessful, for an unsuccessful transfer the job is placed on a hold and the job ClassAd attribute HoldReason is set with a message along with HoldReasonSubCode which is set to the exit status of the plug-in.

In the bash codes above I am only using the first argument that is received by the plugin. Further, it is decided that the value of transfer_input_files will follow the format as commented in the script variable transferInputstr_format i.e. 'nfs:<abs path to (nfs hosted) inputfile file>:<basename of vminstance file>'. Thus after splitting the first argument received by the plugin, the plug-in creates a qcow2 image with a backing file based on the original template.

Now once we send a condor reconfig using condor_reconfig to the worker node or restart condor service (service condor restart) on the worker nodes the plug-in is ready to be used; an example submit file is shown below.

Example Job Description

#Condor job description file
universe=vm
vm_type=kvm
executable=agurutest_vm
vm_networking=true
vm_no_output_vm=true
vm_memory=1536
#Point to the nfs location that will be available from worker node
transfer_input_files=nfs://<path to the vm image>:vmimage.img
vm_disk="vmimage.img:hda:rw"
requirements= (TARGET.FileSystemDomain =!= FALSE) && ( TARGET.VM_Type == "kvm" ) && ( TARGET.VM_AvailNum > 0 ) && ( VM_Memory >= 0 ) 
log=test.log
queue 1

This submit file should invoke the vm-nfs-plugin and a VM instance should start on a worker node. You can test the VM using a shell on the worker node and then using virsh utility.

That is all for this blog, in the Part 3 which is the last part of this series I will write about using file transfer plugin with Storage Resource Manager (SRM).

Thursday, September 8, 2011

Per-Batch Job Network Statistics

Introduction

The OSG takes a fairly abstract definition of a cloud:

A cloud is a service that provision resources on-demand for a marginal cost

The two important pieces of this definition are "resource provisioning" and "marginal cost". The most common cloud instance you'll run into is Amazon EC2, which provisions VMs; depending on the size of VM, the marginal cost is between $0.03 and $0.80 an hour.

The EC2 charge model is actually more complicated than just VMs-per-hour. There's additional charges for storage and network use. In controlled experiments last year, CMS determined the largest cost of using EC2 was not the CPU time, but the network usage.

This showed a glaring hole in OSG's current accounting: we only record wall and CPU time. For the rest of other metrics - which can't be estimated accurately by looking at wall time - we are blind.

Long story short - if OSG ever wants to provide a cloud service using our batch systems, we need better accounting.

Hence, we are running a technology investigation to bring batch system accounting up to par with EC2's: https://jira.opensciencegrid.org/browse/TECHNOLOGY-2

Our current target is to provide a proof-of-concept using Condor. With Condor 7.7.0's cgroup integration, the CPU/memory usage is very accurate, but network accounting for vanilla jobs is missing. Network accounting is the topic for this post; we have the following goals:

The accounting should be done for all processes spawned during the batch job.
All network traffic should be included.
Separately account LAN traffic from WAN traffic (in EC2, these have different costs).

The Woes of Linux Network Accounting

The state of Linux network accounting, well, sucks (for our purposes!). Here's a few ways to tackle it, and why each of them won't work:

Counting packets through an interface: If you assume that there is only one job per host, you can count the packets that go through a network interface. This is a big, currently unlikely, assumption.
Per-process accounting: There exists a kernel patch floating around on the internet that adds per-process in/out statistics. However, other than polling frequently, we have no mechanism to account for short-lived processes. Besides, asking folks to run custom kernels is a good way to get ignored.
cgroups: There is a net controller in cgroups. This marks packets in such a way that they can be manipulated by the tc utility. tc controls the layer of buffering before packets are transferred to the network card and can do accounting. Unfortunately:

In RHEL6, there's no way to persist tc rules.
This only accounts for outgoing packets; incoming packets do not pass through.
We cannot distinguish between local network traffic and off-campus network traffic. This can actually be overcome with a technique similar in difficulty to byte packet filters (BPF), but would be difficult.

ptrace or dynamic loader techniques: There exists libraries (exemplified by parrot) that provide a mechanism for intercepting calls. We could instrument this. However, this path is notoriously buggy and difficult to maintain: it would require a lot of code, and not work for statically-compiled processes.

The most full-featured network accounting is in the routing code controlled by iptables. Particularly, this can account incoming and outgoing traffic, plus differentiate between on-campus and off-campus traffic.

We're going to tackle the problem using iptables; the trick is going to be distinguishing all the traffic from a single batch job. As in the previous series on managing batch system processes, we are going borrow heavily from techniques used in Linux containers.

Per-Batch Job Network Statistics

To get perfect per-batch-job network statistics that differentiate between local and remote traffic, we will combine iptables, NAT, virtual ethernet devices, and network namespaces. It will somewhat be a tour-de-force of the Linux kernel networking - and currently very manual. Automation is still forthcoming.

This recipe is a synthesis of the ideas presented in the following pages:

Manually setting up networking for a container: http://lxc.sourceforge.net/index.php/about/kernel-namespaces/network/configuration/
Traffic accounting with iptables: http://www.catonmat.net/blog/traffic-accounting-with-iptables/
Using a NAT between the "container"

We'll be thinking of the batch job as a "semi-container": it will get its own network device like a container, but have more visibility to the OS than in a container. To follow this recipe, you'll need RHEL6 or later.

First, we'll create a pair of ethernet devices and set up NAT-based routing between them and the rest of the OS. We will assume eth0 is the outgoing network device and that the IPs 192.168.0.1 and 192.168.0.2 are currently not routed in the network.

Enable IP forwarding:
```
echo 1 > /proc/sys/net/ipv4/ip_forward
```
Create an veth ethernet device pair:
```
ip link add type veth
```
This will create two devices, veth0 and veth1, that act similar to a Unix pipe: bytes sent to veth1 will be received by veth0 (and vice versa).
Assign IPs to the new veth devices; we will use 192.168.0.1 and 192.168.0.2:
```
ifconfig veth0 192.168.0.1/24 up
ifconfig veth1 192.168.0.2/24 up
```
Download and compile ns_exec.c; this is a handy utility developed by IBM that allows us to create processes in new namespaces. Compilation can be done like this:
```
gcc -o ns_exec ns_exec.c
```
This requires a RHEL6 kernel and the kernel headers
In a separate window, launch a new shell in a new network and mount namespace:
```
./ns_exec -nm -- /bin/bash
```
We'll refer to this as shell 2 and our original window as shell 1.
Use ps to determine the pid of shell 2. In shell 1, execute:
```
ip link set veth1 netns $PID_OF_SHELL_2
```
In shell 2, you should be able to run ifconfig and see veth1.
In shell 2, re-mount the /sys filesystem and enable the loopback device:
```
mount -t sysfs none /sys
ifconfig lo up
```

At this point, we have a "batch job" (shell 2) with its own dedicated networking device. All traffic generated by this process - or its children - must pass through here. Traffic generated in shell 2 will go into veth1 and out veth0. However, we haven't hooked up the routing for veth0, so packets currently stop there; fairly useless.

Next, we create a NAT between veth0 and eth0. This is a point of convergence - alternately, we could bridge the networks at layer 2 or layer 3 and provide the job with its own public IP. I'll leave that as an exercise for the reader. For the NAT, I will assume that 129.93.0.0/16 is the on-campus network and everything else is off-campus. Everything will be done in shell 1:

Verify that any firewall won't be blocking NAT packets. If you don't know how to do that, turn off the firewall with
```
iptables -F
```
. If you want a firewall, but don't know how iptables works, then you probably want to spend a few hours learning first.

Enable the packet mangling for NAT:

iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Forward packets from veth0 to eth0, using separate rules for on/off campus:

iptables -A FORWARD -i veth0 -o eth0 --dst 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i veth0 -o eth0 ! --dst 129.93.0.0/16 -j ACCEPT

Forward TCP connections from eth0 to veth0 using separate rules:

iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED --src 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED ! --src 129.93.0.0/16 -j ACCEPT

At this point, you can switch back to shell 2 and verify the network is working. iptables will automatically do accounting; you just need to enable command line flags to get it printed:
iptables -L -n -v -x
If you look at the network accounting reference, they show how to separate all the accounting rules into a separate chain. This allows you to, for example, reset counters for only the traffic accounting. On my example host, the output looks like this:

Chain INPUT (policy ACCEPT 4 packets, 524 bytes)
    pkts      bytes target     prot opt in     out     source               destination        

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination        
      30     1570 ACCEPT     all  --  veth0  eth0    0.0.0.0/0            129.93.0.0/16      
      18     1025 ACCEPT     all  --  veth0  eth0    0.0.0.0/0           !129.93.0.0/16      
      28    26759 ACCEPT     all  --  eth0   veth0   129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED
      17    10573 ACCEPT     all  --  eth0   veth0  !129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED

Chain OUTPUT (policy ACCEPT 4 packets, 276 bytes)
    pkts      bytes target     prot opt in     out     source               destination

As you can see, my "job" has downloaded about 26KB from on-campus and 10KB from off-campus.

Viola! Network accounting appropriate for a batch system!

Friday, August 26, 2011

Creating a VM for OpenStack

Intro

Here at HCC, we have a few VM-based projects going. One is the Condor-based VM launching that Ashu referenced in his previous posting. That project is to take an existing capability (Condor batch system hooked to the grid) and extending it; instead of launching processes, one can launch an entire VM.

One of our other employees, Josh, has been working from the other direction: taking a common "cloud platform", OpenStack, and seeing if it can be adopted to our high-throughput needs. The OpenStack work is in its beginning phases, but bits and pieces are starting to become functional.

Last night, I tried out install for the first time. One of the initial tasks I wanted to accomplish is to create a custom VM. A lot of the OpenStack documentation is fairly Ubuntu specific, so I've taken their pages and adopted them for installing from a CentOS 5.6 machine. Unfortunately, I didn't take any nice screen shots like Ashu did, but I hope this will be useful to others.

Long term, we plan to open OpenStack up to select OSG VOs for testing. While we are still in the "tear it down and rebuild once a week" mode, it's just been opened up to select HCC users.

So, without further ado, I present...

Creating a new Fedora image using HCC's OpenStack

These notes are based on the upstream openstack documents here:

http://docs.openstack.org/trunk/openstack-compute/admin/content/creating-a-linux-image.html

Prerequisites

It all starts with an account.

For local users, contact hcc-support to get your access credentials. They will come in a zipfile. Download the zipfile into your home directory and unpack it. Among other things, there will be a novarc file. Source this:

source novarc

This will set up environment variables in your shell pointing to your login credentials. Do not share these with other people! You will need to do this each time you open a new shell.

To create the image, you will need root access on a development machine with KVM installed. I used a CentOS 5.6 machine and did:

yum groupinstall kvm

to get the various necessary KVM packages. I als

First, create a new raw image file:

qemu-img create -f raw /tmp/server.img 5G

This will be the block device that is presented to your virtual machine; make it as large as necessary. Our current hardware is pretty space-limited: smaller is encouraged. Next, download the Fedora boot ISO:

curl http://serverbeach1.fedoraproject.org/pub/alt/bfo/bfo.iso > /tmp/bfo.iso

This is a small, 670KB ISO file that contains just enough information to bootstrap the Anaconda installer. Next, we'll boot it as a virtual machine on your local system.

sudo /usr/libexec/qemu-kvm -m 2048 -cdrom /tmp/bfo.iso -drive file=/tmp/server.img -boot d -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize

This will create a simple virtual machine (2 cores, 2GB RAM) with /tmp/server.img as a drive, and boot the machine from /tmp/bfo.iso. It will also allow you to connect to the VM via a VNC viewer.

If you are physically on the host machine, you can use a VNC viewer for screen ":0". If you are logged in remotely (I log in from my Mac), you'll want to port-forward:

ssh -L 5900:localhost:5900 username@remotemachine.example.com

From your laptop, connect to localhost:0 with a VNC viewer. Note that the most common VNC viewers on the Mac (the built-in Remote Viewer and Chicken of the VNC) don't work with KVM. I found that "JollyFastVNC" works, but costs $5 from the App Store.

Once logged in, select the version of Fedora you'd like to install, and "click next" until the installation is done. Fedora 15 is sure nice :)

Fedora will want to reboot the machine, but the reboot will fail because KVM is set to only boot from the CD. So, once it tries to reboot, kill KVM and start it again with the following arguments:

sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/server.img -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize

Again, connect via VNC, and do any post-install customization. Start by updating and turning on SSH:

yum update
yum install openssh-server
chkconfig sshd on

You will need to tweak /etc/fstab to make it suitable for a cloud instance. Nova-compute may resize the disk at the time of launch of instances based on the instance type chosen. This can make the UUID of the disk invalid. Further, we will remove the LVM setup, and just have the root partition present (no swap, no /boot).

Edit /mnt/etc/fstab. Change the following three lines:

/dev/mapper/VolGroup-lv_root /                       ext4    defaults        1 1
UUID=0abae194-64c8-4d13-a4c0-6284d9dcd7b4 /boot                   ext4    defaults        1 2
/dev/mapper/VolGroup-lv_swap swap                    swap    defaults        0 0

to just one line:

LABEL=uec-rootfs              /          ext4           defaults     0    0

Since, Fedora does not ship with an init script for OpenStack, we will do a nasty hack for pulling the correct SSH key at boot. Edit the /etc/rc.local file and add the following lines before the line "touch /var/lock/subsys/local":

depmod -a
modprobe acpiphp

# simple attempt to get the user ssh key using the meta-data service
mkdir -p /root/.ssh
echo >> /root/.ssh/authorized_keys
curl -m 10 -s http://169.254.169.254/latest/meta-data/public-keys/0/openssh-key | grep 'ssh-rsa' >> /root/.ssh/authorized_keys
echo "AUTHORIZED_KEYS:"
echo "************************"
cat /root/.ssh/authorized_keys
echo "************************"

Once you are finished customizing, go ahead and power off:

poweroff

Converting to an acceptable OpenStack format

The image that needs to be uploaded to OpenStack needs to be an ext4 filesystem image; we currently have a raw block device image. We will extract this filesystem from running a few commands on the host machine. First, we need to find out the starting sector of the partition. Run:

fdisk -ul /tmp/server.img

You should see an output like this (the error messages are harmless):

last_lba(): I don't know how to handle files with mode 81a4
You must set cylinders.
You can do this from the extra functions menu.

Disk /dev/loop0: 5368 MB, 5368709120 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes

      Device Boot      Start         End      Blocks   Id  System
/dev/loop0p1   *        2048     1026047      512000   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/loop0p2         1026048    10485759     4729856   8e  Linux LVM
Partition 2 does not end on cylinder boundary.

Note the following commands assume the units are 512 bytes. You will need the start and end number for the "Linux LVM"; in this case, it is 1026048 and 10485759.

Copy the entire partition to a new file

dd if=/tmp/server.img of=/tmp/server.lvm.img skip=1026048 count=$((10485759-1026048)) bs=512

For "skip" and "count", use the begin and end you copy/pasted from the fdisk output. Now we have our LVM image; we'll need to activate it. First, mount the LVM image on the loopback device and look for the volume group name:

[bbockelm@localhost ~]$ sudo /sbin/losetup /dev/loop0 /tmp/server.lvm.img
[bbockelm@localhost ~]$ sudo /sbin/pvscan
  PV /dev/sdb1    VG vg_home     lvm2 [7.20 TB / 0    free]
  PV /dev/sda2    VG vg_system   lvm2 [73.88 GB / 0    free]
  PV /dev/loop0   VG VolGroup    lvm2 [4.50 GB / 0    free]
  Total: 3 [1.28 TB] / in use: 3 [1.28 TB] / in no VG: 0 [0   ]

Note the third listing is for our loopback device (/dev/loop0) and a volume group named, simply, "VolGroup". We'll want to activate that:

[bbockelm@localhost ~]$ sudo /sbin/vgchange -ay VolGroup
  2 logical volume(s) in volume group "VolGroup" now active

We can now see the Fedora root file system in /dev/VolGroup/lv_root. We use dd to make a copy of this disk:

sudo dd if=/dev/VolGroup/lv_root of=/tmp/serverfinal.img

I get the following output:

[bbockelm@localhost ~]$ sudo dd if=/dev/VolGroup/lv_root of=/tmp/serverfinal2.img
3145728+0 records in
3145728+0 records out
1610612736 bytes (1.6 GB) copied, 14.5444 seconds, 111 MB/s

It's time to unmount all our devices. Start by removing the LVM:

[bbockelm@localhost ~]$ sudo /sbin/vgchange -an VolGroup
  0 logical volume(s) in volume group "VolGroup" now active

Then, unmount our loopback device:

[bbockelm@localhost ~]$ sudo /sbin/losetup -d /dev/loop0

We will do one last tweak: change the label on our filesystem image to "uec-rootfs":

sudo /sbin/tune2fs -L uec-rootfs /tmp/serverfinal.img

*Note* that your filesystem image is ext4; if your host is RHEL5.x (this is my case!), your version of tune2fs will not be able to complete this operation. In this case, you will need to restart your VM in KVM with the newly-extracted serverfinal.img as a second hard drive. I did the following KVM invocation:

sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/server.img -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize -drive file=/tmp/serverfinal.img

The second drive shows up as /dev/sdb; go ahead and re-execute tune2fs from within the VM:

[root@localhost ~]# tune2fs -L uec-rootfs /dev/sdb

Extract Kernel and Initrd for OpenStack

Fedora creates a small boot partition separate from the LVM we extracted previously. We'll need to mount it, and copy out the kernel and initrd. First, mount the loopback device and map the partitions.

[bbockelm@localhost ~]$ sudo /sbin/losetup -f /tmp/server.img
[bbockelm@localhost ~]$ sudo /sbin/kpartx -a /dev/loop0

The boot partition should now be available at /dev/mapper/loop0p1. Mount this:

[bbockelm@localhost ~]$ sudo mkdir  /tmp/server_image/
[bbockelm@localhost ~]$ sudo mount /dev/mapper/loop0p1  /tmp/server_image/

Now, copy out the kernel and initrd:

[bbockelm@localhost ~]$ cp /tmp/server_image/vmlinuz-2.6.40.3-0.fc15.x86_64 ~
[bbockelm@localhost ~]$ cp /tmp/server_image/initramfs-2.6.40.3-0.fc15.x86_64.img ~

Unmount and unmap:

[bbockelm@localhost ~]$ sudo umount /tmp/server_image
[bbockelm@localhost ~]$ sudo /sbin/kpartx -d /dev/loop0
[bbockelm@localhost ~]$ sudo /sbin/losetup -d /dev/loop0

Upload into OpenStack

We need to bundle, then upload the kernel, initrd, and finally the image. First, the kernel:

[bbockelm@localhost ~]$ euca-bundle-image -i ~/vmlinuz-2.6.40.3-0.fc15.x86_64 --kernel true
Checking image
Encrypting image
Splitting image...
Part: vmlinuz-2.6.40.3-0.fc15.x86_64.part.00
Generating manifest /tmp/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
[bbockelm@localhost ~]$ euca-upload-bundle -b testbucket -m /tmp/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
Checking bucket: testbucket
Uploading manifest file
Uploading part: vmlinuz-2.6.40.3-0.fc15.x86_64.part.00
Uploaded image as testbucket/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
[bbockelm@localhost ~]$ euca-register testbucket/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
IMAGE	aki-0000000a

Write down the kernel ID; it is aki-0000000a above. Then, the initrd:

euca-bundle-image -i ~/initramfs-2.6.40.3-0.fc15.x86_64.img --ramdisk true
euca-upload-bundle -b testbucket -m /tmp/initramfs-2.6.40.3-0.fc15.x86_64.img.manifest.xml
euca-register testbucket/initramfs-2.6.40.3-0.fc15.x86_64.img.manifest.xml

My initrd's ID was ari-0000000b. Finally, the disk image itself

euca-bundle-image --kernel aki-0000000a --ramdisk ari-0000000b -i /tmp/serverfinal.img -r x86_64

This will save the image into /tmp and named "serverfinal.img.manifest.xml". I didn't particularly care for the name, so I changed it to "fedora-15.img.manifest.xml". Now, upload:

euca-upload-bundle -b testbucket -m /tmp/fedora-15.img.manifest.xml
euca-register testbucket/serverfinal2.img.manifest.xml

Congratulations! You now have a brand-new Fedora-15 image ready to use. Fire up HybridFox and see if you were successful.

Thursday, August 18, 2011

KVM and Condor (Part 1): Creating the virtual machine.

My next topic of discussion which will be a two part blog is regarding launching a Virtual Machine (VM) in a Condor environment. In the first of these two blogs I will share the steps that I took to create a VM that I will launch as a job in Condor.

I will be using Kernel-based Virtual Machine (KVM) implementation for Linux Guests. KVM is a full virtualization framework which can run multiple unmodified guests including various flavors of Microsoft Windows, Linux Operating Systems and other UNIX family systems. In order to see the types of Guest operating systems and platforms that KVM supports you can look at http://www.linux-kvm.org/page/Guest_Support_Status

Let’s get started. For this blog the host system on which I am working is running CentOS 6.0 with Linux 2.6.32 on a x86_64 platform. I will be creating a CentOS 5.6 image for the VM guest. As the first step, I will get my host system ready with KVM tools and other dependencies. To do this I require a package called kvm – this package includes the VM kernel module. In addition to the kvm package I will be using three tools (viz. virt-install, virsh, and virt-viewer) from toolkit called libvirt. Libvirt (http://libvirt.org/) is a hypervisor-independent API that is able to interact with the virtualization capabilities of various operating systems. The commands below show you how to use yum to install kvm and libvirt related packages:

yum install kvm

yum install virt-manager libvirt libvirt-python python-virtinst libvirt-client

I am now ready to create the VM by using the following command:

   1:  virt-install \

   2:  --name=vm56-25GB \

   3:  --disk path=/home/aguru/myvms/vm5.6-25GB.img,sparse=true,size=25 \

   4:  --ram=2048 \

   5:  --location=http://mirror.unl.edu/centos/5.6/os/x86_64/ \

   6:  --os-type=linux  \

   7:  --vnc

In the above code snippet 'virt-install' is a libvirt command line tool for provisioning new virtual machines. The different options that I have used above are explained below
--name is the name of the new machine that I am creating
--disk option specifies the absolute path of the virtual machine image (file) that will be created. The ‘sparse’ option in the same line means that the host system does not have to allocate all the space up-front, and the ‘size’ gives the size of the hard disk drive of the VM in GB
--ram is the RAM of guest in MB
--location using this option I am providing a location for network install where the OS install files for the guest are located
--os-type specifies type of guest operating system
--vnc specifies to setup a virtual console in the guest and export it as a VNC server in host

Unless there are any missing dependencies and tools that somehow did not get installed correctly - your install should start with a new VNC window popping up on your display. I have a few screen captures of what you may see shown below.

** Just a quick note - to release the mouse cursor from the VNC window you can use Ctrl-Alt.

and so on with finally a screen as below

On the final screen of installation you can click the 'Reboot' button from the VM window to restart the guest VM.

Few basic commands to list, start and stop a VM

virsh list –all

The output of virsh list --all shows the defined VMs and their current state for e.g. a typical output may look like:

Id Name                 State
----------------------------------
- vm56-15KSGB          shut off
- vm56-25GB            shut off

In order to start a VM from the shut off state issue a virsh start command. Note below that the virsh list –all now shows an Id and the running state of the VM (vm56-15KSGB)

virsh start vm56-15KSGB

virsh list --all
Id Name                 State
----------------------------------
1 vm56-15KSGB          running
- vm56-25GB            shut off

To launch a VNC console for displaying the console of a running VM you can use virt-viewer e.g.

virt-viewer  1

And finally, to shutdown a running VM use virsh shutdown or force a virsh destroy e.g.

virsh shutdown 1

virsh destroy 1

Both virt-viewer and virsh shutdown take the Id of the running VM as an argument.

What if I have a Kickstart file for the VM I want to create?

In case you have a Kickstart file that you will like to use for creating the VM you may use the following command:

   1:  virt-install \

   2:  --name=vm56-15KSGB \

   3:  --disk path=/home/aguru/myvms/vm56-15KSGB.img,sparse=true,size=15 \

   4:  --ram=2048 \

   5:  --location=http://newman.ultralight.org/os/centos/5.5/x86_64 \

   6:  --os-type=linux  \

   7:  --vnc \

   8:  -x "ks=http://httpdserver.hosting.kickstart/pathto.kickstart.file"

The only thing to note which is additional in this virt-install command as compared to its previous use in this blog is the extra flag '–x '. The value passed along with the -x flag points to the location of the web location of the kickstart file.

That is it all for this post. In the next post I will talk about using this created image and then launching it in a Condor VM Universe.