OSG Technology Area Rumblings

Wednesday, October 19, 2011

KVM and Condor (Part 2): Condor configuration for VM Universe & VM Image Staging

This is Part 2 of my previous blog KVM and Condor (Part 1): Creating the virtual machine. In this blog I will share the steps for configuring Condor VM Universe, in addition I will also discuss the steps involved in staging the VM disk images. It is assumed that you have a basic setup of Condor working and there is a shared file system that is accessible from each of the worker nodes.

As a first step please make sure that the worker nodes support KVM based virtualization, if they do not, then you may use:

yum groupinstall "KVM"

and yum -y install kvm libvirt libvirt-python python-virtinst libvirt-client

Configuring Condor for KVM

For Condor to support VM universe the following attributes must be set in the Condor configuration of each of the worker nodes (this may be done by modifying the the local Condor config file)

VM_GAHP_SERVER = $(SBIN)/condor_vm-gahp
VM_GAHP_LOG = $(LOG)/VMGahpLog
VM_MEMORY = 5000
VM_TYPE = kvm
VM_NETWORKING = true
VM_NETWORKING_TYPE = nat
ENABLE_URL_TRANSFERS = TRUE
FILETRANSFER_PLUGINS = /usr/local/bin/vm-nfs-plugin

The explanation of the above attributes follow:

Attribute	Description
VM_GAHP_SERVER	The complete path and file name of the condor_vm-gahp.
VM_GAHP_LOG	The complete path and file name of the condor_vm-gahp log.
VM_MEMORY	A VM universe job is required to specify the memory needs for the disk image with vm_memory (Mbytes) in its job description file. On the worker node the value of the VM_MEMORY configuration is used for matching the memory requested by the job. VM_MEMORY is an integer value that specifies the maximum amount of memory in Mbytes that will be allowed for the virtual machine program.
VM_TYPE	This attribute can have values: kvm, xen or vmware and specify the type of supported virtual machine software.
VM_NETWORKING	Must be set to true to support networking in the VM instances.
VM_NETWORKING_TYPE	This is a string value describing the type of networking.
ENABLE_URL_TRANSFERS	This is a Boolean value when True causes the condor_starter for a job to invoke all plug-ins defined by FILETRANSFER_PLUGINS when a file transfer is specified with a URL in the job description file.
FILETRANSFER_PLUGINS	Is a comma separated list of absolute paths of executable(s) for plug-ins that will accomplish the task of file transfer when a job requests the transfer of an input file by specifying a URL.

The File Transfer Plugin

So far we have modified the configurations of the condor worker node for supporting Condor VM universe. Next I will describe a barebones FILETRANSFER_PLUGINS executable. I will use bash for scripting and the plugin will reside at :/usr/local/bin/vm-nfs-plugin on each of the worker nodes.

#!/bin/bash
#file: /usr/local/bin/vm-nfs-plugin
#----------------------------------------
# Plugin Essential
if [ "$1" = "-classad" ]
then
   echo "PluginVersion = \"0.1\""
   echo "PluginType = \"FileTransfer\""
   echo "SupportedMethods = \"nfs\""
   exit 0
fi

#----------------------------------------
# Variable definitions
# transferInputstr_format='nfs:<abs path to (nfs hosted) inputfile file>:<basename of vminstance file>'
WHICHQEMUIMG='/usr/bin/qemu-img'
initdir=$PWD
transferInputstr=$1
#-------------------------------------------
# Split the first argument to an array
IFS=':' read -ra transferInputarray <<< "$transferInputstr"
#-------------------------------------------
#create the vm instance copy on write
$WHICHQEMUIMG create -b ${transferInputarray[1]} -f  qcow2   ${initdir}/${transferInputarray[2]}
exit 0;

Overall the idea behind the above script is to create a qcow2 formatted VM instance file in the condor allocated execute folder. The details of code blocks above are listed below:

The “# Plugin Essential” part of the codes is a requirement for a Condor file transfer plug-in so that a plug-in can be registered appropriately to handle file transfers based on the methods (protocols) it supports. The condor_starter daemon invokes each plug-in with a command line argument ‘-classad’ to identify the protocols that a plug-in supports, it expects that the plug-in will respond with an output of three ClassAd attributes. The first two are fixed: PluginVersion = "0.1" and PluginType = "FileTransfer"; the third is the ClassAd attribute ‘SupportedMethods’ having a string value containing comma separated list of the protocols that the plug-in handles. Thus, in the script above SupportedMethods = "nfs" identifies that the plug-in vm-nfs-plugin supports a user defined protocol ‘nfs’. Accordingly, the ‘nfs’ string will be matched to the protocol specification as given within a URL in the transfer_input_files command in a Condor job description file.

For a file transfer invocation a plug-in is invoked with two arguments - the first being the URL specified in the job description file; and the second argument being the absolute path identifying where to place the transferred file. The plug-in is expected to transfer the file and exit with a status of 0 when the transfer is successful. A non-zero status must be returned when the transfer is unsuccessful, for an unsuccessful transfer the job is placed on a hold and the job ClassAd attribute HoldReason is set with a message along with HoldReasonSubCode which is set to the exit status of the plug-in.

In the bash codes above I am only using the first argument that is received by the plugin. Further, it is decided that the value of transfer_input_files will follow the format as commented in the script variable transferInputstr_format i.e. 'nfs:<abs path to (nfs hosted) inputfile file>:<basename of vminstance file>'. Thus after splitting the first argument received by the plugin, the plug-in creates a qcow2 image with a backing file based on the original template.

Now once we send a condor reconfig using condor_reconfig to the worker node or restart condor service (service condor restart) on the worker nodes the plug-in is ready to be used; an example submit file is shown below.

Example Job Description

#Condor job description file
universe=vm
vm_type=kvm
executable=agurutest_vm
vm_networking=true
vm_no_output_vm=true
vm_memory=1536
#Point to the nfs location that will be available from worker node
transfer_input_files=nfs://<path to the vm image>:vmimage.img
vm_disk="vmimage.img:hda:rw"
requirements= (TARGET.FileSystemDomain =!= FALSE) && ( TARGET.VM_Type == "kvm" ) && ( TARGET.VM_AvailNum > 0 ) && ( VM_Memory >= 0 ) 
log=test.log
queue 1

This submit file should invoke the vm-nfs-plugin and a VM instance should start on a worker node. You can test the VM using a shell on the worker node and then using virsh utility.

That is all for this blog, in the Part 3 which is the last part of this series I will write about using file transfer plugin with Storage Resource Manager (SRM).

Thursday, September 8, 2011

Per-Batch Job Network Statistics

Introduction

The OSG takes a fairly abstract definition of a cloud:

A cloud is a service that provision resources on-demand for a marginal cost

The two important pieces of this definition are "resource provisioning" and "marginal cost". The most common cloud instance you'll run into is Amazon EC2, which provisions VMs; depending on the size of VM, the marginal cost is between $0.03 and $0.80 an hour.

The EC2 charge model is actually more complicated than just VMs-per-hour. There's additional charges for storage and network use. In controlled experiments last year, CMS determined the largest cost of using EC2 was not the CPU time, but the network usage.

This showed a glaring hole in OSG's current accounting: we only record wall and CPU time. For the rest of other metrics - which can't be estimated accurately by looking at wall time - we are blind.

Long story short - if OSG ever wants to provide a cloud service using our batch systems, we need better accounting.

Hence, we are running a technology investigation to bring batch system accounting up to par with EC2's: https://jira.opensciencegrid.org/browse/TECHNOLOGY-2

Our current target is to provide a proof-of-concept using Condor. With Condor 7.7.0's cgroup integration, the CPU/memory usage is very accurate, but network accounting for vanilla jobs is missing. Network accounting is the topic for this post; we have the following goals:

The accounting should be done for all processes spawned during the batch job.
All network traffic should be included.
Separately account LAN traffic from WAN traffic (in EC2, these have different costs).

The Woes of Linux Network Accounting

The state of Linux network accounting, well, sucks (for our purposes!). Here's a few ways to tackle it, and why each of them won't work:

Counting packets through an interface: If you assume that there is only one job per host, you can count the packets that go through a network interface. This is a big, currently unlikely, assumption.
Per-process accounting: There exists a kernel patch floating around on the internet that adds per-process in/out statistics. However, other than polling frequently, we have no mechanism to account for short-lived processes. Besides, asking folks to run custom kernels is a good way to get ignored.
cgroups: There is a net controller in cgroups. This marks packets in such a way that they can be manipulated by the tc utility. tc controls the layer of buffering before packets are transferred to the network card and can do accounting. Unfortunately:

In RHEL6, there's no way to persist tc rules.
This only accounts for outgoing packets; incoming packets do not pass through.
We cannot distinguish between local network traffic and off-campus network traffic. This can actually be overcome with a technique similar in difficulty to byte packet filters (BPF), but would be difficult.

ptrace or dynamic loader techniques: There exists libraries (exemplified by parrot) that provide a mechanism for intercepting calls. We could instrument this. However, this path is notoriously buggy and difficult to maintain: it would require a lot of code, and not work for statically-compiled processes.

The most full-featured network accounting is in the routing code controlled by iptables. Particularly, this can account incoming and outgoing traffic, plus differentiate between on-campus and off-campus traffic.

We're going to tackle the problem using iptables; the trick is going to be distinguishing all the traffic from a single batch job. As in the previous series on managing batch system processes, we are going borrow heavily from techniques used in Linux containers.

Per-Batch Job Network Statistics

To get perfect per-batch-job network statistics that differentiate between local and remote traffic, we will combine iptables, NAT, virtual ethernet devices, and network namespaces. It will somewhat be a tour-de-force of the Linux kernel networking - and currently very manual. Automation is still forthcoming.

This recipe is a synthesis of the ideas presented in the following pages:

Manually setting up networking for a container: http://lxc.sourceforge.net/index.php/about/kernel-namespaces/network/configuration/
Traffic accounting with iptables: http://www.catonmat.net/blog/traffic-accounting-with-iptables/
Using a NAT between the "container"

We'll be thinking of the batch job as a "semi-container": it will get its own network device like a container, but have more visibility to the OS than in a container. To follow this recipe, you'll need RHEL6 or later.

First, we'll create a pair of ethernet devices and set up NAT-based routing between them and the rest of the OS. We will assume eth0 is the outgoing network device and that the IPs 192.168.0.1 and 192.168.0.2 are currently not routed in the network.

Enable IP forwarding:
```
echo 1 > /proc/sys/net/ipv4/ip_forward
```
Create an veth ethernet device pair:
```
ip link add type veth
```
This will create two devices, veth0 and veth1, that act similar to a Unix pipe: bytes sent to veth1 will be received by veth0 (and vice versa).
Assign IPs to the new veth devices; we will use 192.168.0.1 and 192.168.0.2:
```
ifconfig veth0 192.168.0.1/24 up
ifconfig veth1 192.168.0.2/24 up
```
Download and compile ns_exec.c; this is a handy utility developed by IBM that allows us to create processes in new namespaces. Compilation can be done like this:
```
gcc -o ns_exec ns_exec.c
```
This requires a RHEL6 kernel and the kernel headers
In a separate window, launch a new shell in a new network and mount namespace:
```
./ns_exec -nm -- /bin/bash
```
We'll refer to this as shell 2 and our original window as shell 1.
Use ps to determine the pid of shell 2. In shell 1, execute:
```
ip link set veth1 netns $PID_OF_SHELL_2
```
In shell 2, you should be able to run ifconfig and see veth1.
In shell 2, re-mount the /sys filesystem and enable the loopback device:
```
mount -t sysfs none /sys
ifconfig lo up
```

At this point, we have a "batch job" (shell 2) with its own dedicated networking device. All traffic generated by this process - or its children - must pass through here. Traffic generated in shell 2 will go into veth1 and out veth0. However, we haven't hooked up the routing for veth0, so packets currently stop there; fairly useless.

Next, we create a NAT between veth0 and eth0. This is a point of convergence - alternately, we could bridge the networks at layer 2 or layer 3 and provide the job with its own public IP. I'll leave that as an exercise for the reader. For the NAT, I will assume that 129.93.0.0/16 is the on-campus network and everything else is off-campus. Everything will be done in shell 1:

Verify that any firewall won't be blocking NAT packets. If you don't know how to do that, turn off the firewall with
```
iptables -F
```
. If you want a firewall, but don't know how iptables works, then you probably want to spend a few hours learning first.

Enable the packet mangling for NAT:

iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Forward packets from veth0 to eth0, using separate rules for on/off campus:

iptables -A FORWARD -i veth0 -o eth0 --dst 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i veth0 -o eth0 ! --dst 129.93.0.0/16 -j ACCEPT

Forward TCP connections from eth0 to veth0 using separate rules:

iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED --src 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED ! --src 129.93.0.0/16 -j ACCEPT

At this point, you can switch back to shell 2 and verify the network is working. iptables will automatically do accounting; you just need to enable command line flags to get it printed:
iptables -L -n -v -x
If you look at the network accounting reference, they show how to separate all the accounting rules into a separate chain. This allows you to, for example, reset counters for only the traffic accounting. On my example host, the output looks like this:

Chain INPUT (policy ACCEPT 4 packets, 524 bytes)
    pkts      bytes target     prot opt in     out     source               destination        

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination        
      30     1570 ACCEPT     all  --  veth0  eth0    0.0.0.0/0            129.93.0.0/16      
      18     1025 ACCEPT     all  --  veth0  eth0    0.0.0.0/0           !129.93.0.0/16      
      28    26759 ACCEPT     all  --  eth0   veth0   129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED
      17    10573 ACCEPT     all  --  eth0   veth0  !129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED

Chain OUTPUT (policy ACCEPT 4 packets, 276 bytes)
    pkts      bytes target     prot opt in     out     source               destination

As you can see, my "job" has downloaded about 26KB from on-campus and 10KB from off-campus.

Viola! Network accounting appropriate for a batch system!

Friday, August 26, 2011

Creating a VM for OpenStack

Intro

Here at HCC, we have a few VM-based projects going. One is the Condor-based VM launching that Ashu referenced in his previous posting. That project is to take an existing capability (Condor batch system hooked to the grid) and extending it; instead of launching processes, one can launch an entire VM.

One of our other employees, Josh, has been working from the other direction: taking a common "cloud platform", OpenStack, and seeing if it can be adopted to our high-throughput needs. The OpenStack work is in its beginning phases, but bits and pieces are starting to become functional.

Last night, I tried out install for the first time. One of the initial tasks I wanted to accomplish is to create a custom VM. A lot of the OpenStack documentation is fairly Ubuntu specific, so I've taken their pages and adopted them for installing from a CentOS 5.6 machine. Unfortunately, I didn't take any nice screen shots like Ashu did, but I hope this will be useful to others.

Long term, we plan to open OpenStack up to select OSG VOs for testing. While we are still in the "tear it down and rebuild once a week" mode, it's just been opened up to select HCC users.

So, without further ado, I present...

Creating a new Fedora image using HCC's OpenStack

These notes are based on the upstream openstack documents here:

http://docs.openstack.org/trunk/openstack-compute/admin/content/creating-a-linux-image.html

Prerequisites

It all starts with an account.

For local users, contact hcc-support to get your access credentials. They will come in a zipfile. Download the zipfile into your home directory and unpack it. Among other things, there will be a novarc file. Source this:

source novarc

This will set up environment variables in your shell pointing to your login credentials. Do not share these with other people! You will need to do this each time you open a new shell.

To create the image, you will need root access on a development machine with KVM installed. I used a CentOS 5.6 machine and did:

yum groupinstall kvm

to get the various necessary KVM packages. I als

First, create a new raw image file:

qemu-img create -f raw /tmp/server.img 5G

This will be the block device that is presented to your virtual machine; make it as large as necessary. Our current hardware is pretty space-limited: smaller is encouraged. Next, download the Fedora boot ISO:

curl http://serverbeach1.fedoraproject.org/pub/alt/bfo/bfo.iso > /tmp/bfo.iso

This is a small, 670KB ISO file that contains just enough information to bootstrap the Anaconda installer. Next, we'll boot it as a virtual machine on your local system.

sudo /usr/libexec/qemu-kvm -m 2048 -cdrom /tmp/bfo.iso -drive file=/tmp/server.img -boot d -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize

This will create a simple virtual machine (2 cores, 2GB RAM) with /tmp/server.img as a drive, and boot the machine from /tmp/bfo.iso. It will also allow you to connect to the VM via a VNC viewer.

If you are physically on the host machine, you can use a VNC viewer for screen ":0". If you are logged in remotely (I log in from my Mac), you'll want to port-forward:

ssh -L 5900:localhost:5900 username@remotemachine.example.com

From your laptop, connect to localhost:0 with a VNC viewer. Note that the most common VNC viewers on the Mac (the built-in Remote Viewer and Chicken of the VNC) don't work with KVM. I found that "JollyFastVNC" works, but costs $5 from the App Store.

Once logged in, select the version of Fedora you'd like to install, and "click next" until the installation is done. Fedora 15 is sure nice :)

Fedora will want to reboot the machine, but the reboot will fail because KVM is set to only boot from the CD. So, once it tries to reboot, kill KVM and start it again with the following arguments:

sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/server.img -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize

Again, connect via VNC, and do any post-install customization. Start by updating and turning on SSH:

yum update
yum install openssh-server
chkconfig sshd on

You will need to tweak /etc/fstab to make it suitable for a cloud instance. Nova-compute may resize the disk at the time of launch of instances based on the instance type chosen. This can make the UUID of the disk invalid. Further, we will remove the LVM setup, and just have the root partition present (no swap, no /boot).

Edit /mnt/etc/fstab. Change the following three lines:

/dev/mapper/VolGroup-lv_root /                       ext4    defaults        1 1
UUID=0abae194-64c8-4d13-a4c0-6284d9dcd7b4 /boot                   ext4    defaults        1 2
/dev/mapper/VolGroup-lv_swap swap                    swap    defaults        0 0

to just one line:

LABEL=uec-rootfs              /          ext4           defaults     0    0

Since, Fedora does not ship with an init script for OpenStack, we will do a nasty hack for pulling the correct SSH key at boot. Edit the /etc/rc.local file and add the following lines before the line "touch /var/lock/subsys/local":

depmod -a
modprobe acpiphp

# simple attempt to get the user ssh key using the meta-data service
mkdir -p /root/.ssh
echo >> /root/.ssh/authorized_keys
curl -m 10 -s http://169.254.169.254/latest/meta-data/public-keys/0/openssh-key | grep 'ssh-rsa' >> /root/.ssh/authorized_keys
echo "AUTHORIZED_KEYS:"
echo "************************"
cat /root/.ssh/authorized_keys
echo "************************"

Once you are finished customizing, go ahead and power off:

poweroff

Converting to an acceptable OpenStack format

The image that needs to be uploaded to OpenStack needs to be an ext4 filesystem image; we currently have a raw block device image. We will extract this filesystem from running a few commands on the host machine. First, we need to find out the starting sector of the partition. Run:

fdisk -ul /tmp/server.img

You should see an output like this (the error messages are harmless):

last_lba(): I don't know how to handle files with mode 81a4
You must set cylinders.
You can do this from the extra functions menu.

Disk /dev/loop0: 5368 MB, 5368709120 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes

      Device Boot      Start         End      Blocks   Id  System
/dev/loop0p1   *        2048     1026047      512000   83  Linux
Partition 1 does not end on cylinder boundary.
/dev/loop0p2         1026048    10485759     4729856   8e  Linux LVM
Partition 2 does not end on cylinder boundary.

Note the following commands assume the units are 512 bytes. You will need the start and end number for the "Linux LVM"; in this case, it is 1026048 and 10485759.

Copy the entire partition to a new file

dd if=/tmp/server.img of=/tmp/server.lvm.img skip=1026048 count=$((10485759-1026048)) bs=512

For "skip" and "count", use the begin and end you copy/pasted from the fdisk output. Now we have our LVM image; we'll need to activate it. First, mount the LVM image on the loopback device and look for the volume group name:

[bbockelm@localhost ~]$ sudo /sbin/losetup /dev/loop0 /tmp/server.lvm.img
[bbockelm@localhost ~]$ sudo /sbin/pvscan
  PV /dev/sdb1    VG vg_home     lvm2 [7.20 TB / 0    free]
  PV /dev/sda2    VG vg_system   lvm2 [73.88 GB / 0    free]
  PV /dev/loop0   VG VolGroup    lvm2 [4.50 GB / 0    free]
  Total: 3 [1.28 TB] / in use: 3 [1.28 TB] / in no VG: 0 [0   ]

Note the third listing is for our loopback device (/dev/loop0) and a volume group named, simply, "VolGroup". We'll want to activate that:

[bbockelm@localhost ~]$ sudo /sbin/vgchange -ay VolGroup
  2 logical volume(s) in volume group "VolGroup" now active

We can now see the Fedora root file system in /dev/VolGroup/lv_root. We use dd to make a copy of this disk:

sudo dd if=/dev/VolGroup/lv_root of=/tmp/serverfinal.img

I get the following output:

[bbockelm@localhost ~]$ sudo dd if=/dev/VolGroup/lv_root of=/tmp/serverfinal2.img
3145728+0 records in
3145728+0 records out
1610612736 bytes (1.6 GB) copied, 14.5444 seconds, 111 MB/s

It's time to unmount all our devices. Start by removing the LVM:

[bbockelm@localhost ~]$ sudo /sbin/vgchange -an VolGroup
  0 logical volume(s) in volume group "VolGroup" now active

Then, unmount our loopback device:

[bbockelm@localhost ~]$ sudo /sbin/losetup -d /dev/loop0

We will do one last tweak: change the label on our filesystem image to "uec-rootfs":

sudo /sbin/tune2fs -L uec-rootfs /tmp/serverfinal.img

*Note* that your filesystem image is ext4; if your host is RHEL5.x (this is my case!), your version of tune2fs will not be able to complete this operation. In this case, you will need to restart your VM in KVM with the newly-extracted serverfinal.img as a second hard drive. I did the following KVM invocation:

sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/server.img -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize -drive file=/tmp/serverfinal.img

The second drive shows up as /dev/sdb; go ahead and re-execute tune2fs from within the VM:

[root@localhost ~]# tune2fs -L uec-rootfs /dev/sdb

Extract Kernel and Initrd for OpenStack

Fedora creates a small boot partition separate from the LVM we extracted previously. We'll need to mount it, and copy out the kernel and initrd. First, mount the loopback device and map the partitions.

[bbockelm@localhost ~]$ sudo /sbin/losetup -f /tmp/server.img
[bbockelm@localhost ~]$ sudo /sbin/kpartx -a /dev/loop0

The boot partition should now be available at /dev/mapper/loop0p1. Mount this:

[bbockelm@localhost ~]$ sudo mkdir  /tmp/server_image/
[bbockelm@localhost ~]$ sudo mount /dev/mapper/loop0p1  /tmp/server_image/

Now, copy out the kernel and initrd:

[bbockelm@localhost ~]$ cp /tmp/server_image/vmlinuz-2.6.40.3-0.fc15.x86_64 ~
[bbockelm@localhost ~]$ cp /tmp/server_image/initramfs-2.6.40.3-0.fc15.x86_64.img ~

Unmount and unmap:

[bbockelm@localhost ~]$ sudo umount /tmp/server_image
[bbockelm@localhost ~]$ sudo /sbin/kpartx -d /dev/loop0
[bbockelm@localhost ~]$ sudo /sbin/losetup -d /dev/loop0

Upload into OpenStack

We need to bundle, then upload the kernel, initrd, and finally the image. First, the kernel:

[bbockelm@localhost ~]$ euca-bundle-image -i ~/vmlinuz-2.6.40.3-0.fc15.x86_64 --kernel true
Checking image
Encrypting image
Splitting image...
Part: vmlinuz-2.6.40.3-0.fc15.x86_64.part.00
Generating manifest /tmp/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
[bbockelm@localhost ~]$ euca-upload-bundle -b testbucket -m /tmp/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
Checking bucket: testbucket
Uploading manifest file
Uploading part: vmlinuz-2.6.40.3-0.fc15.x86_64.part.00
Uploaded image as testbucket/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
[bbockelm@localhost ~]$ euca-register testbucket/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml
IMAGE	aki-0000000a

Write down the kernel ID; it is aki-0000000a above. Then, the initrd:

euca-bundle-image -i ~/initramfs-2.6.40.3-0.fc15.x86_64.img --ramdisk true
euca-upload-bundle -b testbucket -m /tmp/initramfs-2.6.40.3-0.fc15.x86_64.img.manifest.xml
euca-register testbucket/initramfs-2.6.40.3-0.fc15.x86_64.img.manifest.xml

My initrd's ID was ari-0000000b. Finally, the disk image itself

euca-bundle-image --kernel aki-0000000a --ramdisk ari-0000000b -i /tmp/serverfinal.img -r x86_64

This will save the image into /tmp and named "serverfinal.img.manifest.xml". I didn't particularly care for the name, so I changed it to "fedora-15.img.manifest.xml". Now, upload:

euca-upload-bundle -b testbucket -m /tmp/fedora-15.img.manifest.xml
euca-register testbucket/serverfinal2.img.manifest.xml

Congratulations! You now have a brand-new Fedora-15 image ready to use. Fire up HybridFox and see if you were successful.

Thursday, August 18, 2011

KVM and Condor (Part 1): Creating the virtual machine.

My next topic of discussion which will be a two part blog is regarding launching a Virtual Machine (VM) in a Condor environment. In the first of these two blogs I will share the steps that I took to create a VM that I will launch as a job in Condor.

I will be using Kernel-based Virtual Machine (KVM) implementation for Linux Guests. KVM is a full virtualization framework which can run multiple unmodified guests including various flavors of Microsoft Windows, Linux Operating Systems and other UNIX family systems. In order to see the types of Guest operating systems and platforms that KVM supports you can look at http://www.linux-kvm.org/page/Guest_Support_Status

Let’s get started. For this blog the host system on which I am working is running CentOS 6.0 with Linux 2.6.32 on a x86_64 platform. I will be creating a CentOS 5.6 image for the VM guest. As the first step, I will get my host system ready with KVM tools and other dependencies. To do this I require a package called kvm – this package includes the VM kernel module. In addition to the kvm package I will be using three tools (viz. virt-install, virsh, and virt-viewer) from toolkit called libvirt. Libvirt (http://libvirt.org/) is a hypervisor-independent API that is able to interact with the virtualization capabilities of various operating systems. The commands below show you how to use yum to install kvm and libvirt related packages:

yum install kvm

yum install virt-manager libvirt libvirt-python python-virtinst libvirt-client

I am now ready to create the VM by using the following command:

   1:  virt-install \

   2:  --name=vm56-25GB \

   3:  --disk path=/home/aguru/myvms/vm5.6-25GB.img,sparse=true,size=25 \

   4:  --ram=2048 \

   5:  --location=http://mirror.unl.edu/centos/5.6/os/x86_64/ \

   6:  --os-type=linux  \

   7:  --vnc

In the above code snippet 'virt-install' is a libvirt command line tool for provisioning new virtual machines. The different options that I have used above are explained below
--name is the name of the new machine that I am creating
--disk option specifies the absolute path of the virtual machine image (file) that will be created. The ‘sparse’ option in the same line means that the host system does not have to allocate all the space up-front, and the ‘size’ gives the size of the hard disk drive of the VM in GB
--ram is the RAM of guest in MB
--location using this option I am providing a location for network install where the OS install files for the guest are located
--os-type specifies type of guest operating system
--vnc specifies to setup a virtual console in the guest and export it as a VNC server in host

Unless there are any missing dependencies and tools that somehow did not get installed correctly - your install should start with a new VNC window popping up on your display. I have a few screen captures of what you may see shown below.

** Just a quick note - to release the mouse cursor from the VNC window you can use Ctrl-Alt.

and so on with finally a screen as below

On the final screen of installation you can click the 'Reboot' button from the VM window to restart the guest VM.

Few basic commands to list, start and stop a VM

virsh list –all

The output of virsh list --all shows the defined VMs and their current state for e.g. a typical output may look like:

Id Name                 State
----------------------------------
- vm56-15KSGB          shut off
- vm56-25GB            shut off

In order to start a VM from the shut off state issue a virsh start command. Note below that the virsh list –all now shows an Id and the running state of the VM (vm56-15KSGB)

virsh start vm56-15KSGB

virsh list --all
Id Name                 State
----------------------------------
1 vm56-15KSGB          running
- vm56-25GB            shut off

To launch a VNC console for displaying the console of a running VM you can use virt-viewer e.g.

virt-viewer  1

And finally, to shutdown a running VM use virsh shutdown or force a virsh destroy e.g.

virsh shutdown 1

virsh destroy 1

Both virt-viewer and virsh shutdown take the Id of the running VM as an argument.

What if I have a Kickstart file for the VM I want to create?

In case you have a Kickstart file that you will like to use for creating the VM you may use the following command:

   1:  virt-install \

   2:  --name=vm56-15KSGB \

   3:  --disk path=/home/aguru/myvms/vm56-15KSGB.img,sparse=true,size=15 \

   4:  --ram=2048 \

   5:  --location=http://newman.ultralight.org/os/centos/5.5/x86_64 \

   6:  --os-type=linux  \

   7:  --vnc \

   8:  -x "ks=http://httpdserver.hosting.kickstart/pathto.kickstart.file"

The only thing to note which is additional in this virt-install command as compared to its previous use in this blog is the extra flag '–x '. The value passed along with the -x flag points to the location of the web location of the kickstart file.

That is it all for this post. In the next post I will talk about using this created image and then launching it in a Condor VM Universe.

Tuesday, July 12, 2011

Squid Caching in OSG Environment

A few months back I assisted a research group from University of Nebraska Medical Center (UNMC) in deploying a search for mass spectrometry-based proteomics analysis. This search was performed using a program called The Open Mass Spectrometry Search Algorithm (OMSSA) using the Open Science Grid (OSG) via GlideinWMS Frontend. In this blog I will talk about the motivation and use of HTTP file transfer along with squid caching for input data and executable files for the jobs deployed over the OSG. I will also show a basic example explaining the use of Squid in the OSG environment.

While working with the UNMC research group and after looking at the OMSSA specifications and documentation we identified the following characteristics regarding the computation and the data handling requirements for the proteomics analysis:
•   A total of 45 datasets with each dataset of about 21MB.
•   22,000 comparisons/searches (short jobs) per dataset
•   The executables along with search libraries for the comparison sum up to a total of 83MB as a compressed archive.
Based on the above requirements and a few additional tests it was determined that the job is well adapted for OSG via GlideinWMS. It was also decided that each GlideinWMS job will contain about 172 comparisons which calculates to a total of about 5756 individual jobs (22000*45/172).

Data in the Open Science Grid has always been more difficult to handle than computation. The challenges get more difficult when either of the number of jobs, or the data size increase. There are various methods that are used to overcome and simplify these challenges. Table 1 below shows a rule of thumb that I generally follow to help identify the best mode of data transfers for jobs in OSG environment. Each data transfer method in Table 1 has its own advantages viz. Condor’s internal file transfer is built-in method so no extra scripting is required. SRM can handle large data stores, and has the ability to handle large size data transfers. Pre-staging can distribute the load of pulling down data.

Table 1. Rule of thumb for data transfer using condor/GlideinWMS jobs in OSG
Data Size	Data Transfer Method
< 10MB	Condor's File Transfer Mechanism
10MB - 500MB	Storage Element(SE)/Storage Resource Manager(SRM) interface
> 500MB	SRM/dCache or Pre-staging

When the number of jobs are significantly large and the data transfer size reaches the higher limits of Condor Internal File transfer, in our past experience we have found that HTTP file transfer has been fairly successful for us. By doing so we are able to distribute away the load of input file and executables transfer from the GlideinWMS Frontend server. For the proteomics analysis project since the compressed archive of the search library and the executables (83MB) was the same across all jobs, and the input data was the same for individual datasets we decided to extend the limits on our HTTP file transfer experiences by adding squid caching. The advantage of caching becomes more evident when more jobs are allocated compute nodes at a given site having a local (site specific) squid server until we reach the limit of the squid server itself.

Every CMS and ATLAS site is required to have squid whose location is available via the environment variable OSG_SQUID_LOCATION. This implies that by using a very simple wrapper script on a compute node one can easily pull down input files and/or executables using client tool such as wget or curl and then proceed with the actual computation. The example below shows a bash script that reads the OSG_SQUID_LOCATION environment variable on a compute node and then tries to download the file via squid, on a failure the script downloads the file directly from the source. (Ref: https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics)

#!/bin/sh
website=http://google.com/

#Section A
source $OSG_GRID/setup.sh
export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}
if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then
  export http_proxy=$OSG_SQUID_LOCATION
fi

#Section B
wget --retry-connrefused --waitretry=20 $website

#Section C 
#Check if the download worked
if [ $? -ne 0 ]
then
   unset http_proxy
   wget --retry-connrefused --waitretry=20 $website
   if [ $? -ne 0 ]
   then
      exit 1
   fi
fi

Listed below is the explanation of the above code:

Section A: Check if environment variable OSG_SQUID_LOCATION is set, if so then export its value as the environment variable http_proxy which is used by wget for squid server location
Section B: Download the file using wget, the flag --retry-connrefused considers a connection refused as a transient error and tries again. This option helps to handle short term failures. The wait time of 20 seconds in between retries is specified via --waitretry
Section C: If download from the squid server fails then access the actual http source after unsetting the value of http_proxy

In addition to the availability of a OSG site specific squid server, for this type of data transfer to work one will require a reliable http server which can handle download requests from sites where the squid server is unavailable. Also, the http server must be able to handle requests which are originating from the squid servers along with any failover requests. At UNL we have setup a dedicated HTTP serving infrastructure that has a load balanced failover. This is implemented using the Linux Virtual server and its implementation details are shown in the diagram below.

You can see more detailed examples of squid usage at https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics
There is also an excellent presentation by Derek Weitzel available at http://docs.google.com/viewer?url=https%3A%2F%2Ftwiki.grid.iu.edu%2Ftwiki%2Fpub%2FCampusGrids%2FApr27%252c2011%2FCampusGridSquid.pdf&embedded=true

Friday, July 8, 2011

Part III: Bulletproof process tracking with cgroups

Finally, it's time to provide a good solution for accomplishing process tracking in a Linux batch system.
If you recall in Part I, we surveyed common methods for process tracking and ultimately concluded that batch systems used userspace mechanisms (most of which were originally designed for shell-based process control, by the way) that were unreliable, or couldn't detect when failures occur. In Part II, the picture brightened: the kernel provided an event feed about process births and deaths, and informed us when messages were dropped.

In this post, we'll talk about a new feature called "cgroups", short for "control groups". Cgroups are a mechanism in the Linux kernel for managing a set of processes and all their descendents. They are managed through a filesystem-like interface (in the manner of /proc); the directory structure expresses the fact they are hierarchical, and filesystem permissions can be used to restrict the set of users allowed to manipulate them. By default, only root is allowed to manipulate control groups: unlike the process groups, process trees, and environment cookies examined before, a process typically has no ability to change its group. Further, unlike the proc connector API, the control group is assigned synchronously by the kernel at process creation time. Hence, fork-bombs are not an effective way to escape from the group.

While having the tracking done by the kernel is an immense improvement, the true power of cgroups become apparent through the use of multiple subsystems. Different cgroup subsystems may act to control scheduler policy, allocate or limit resources, or account for usage.

For example, the memory controller can be used to limit the amount of memory used by a set of processes. This is a huge improvement over the previous memory limit technique (rlimit), where the limit was assigned per-process. With rlimit, you could limit a single process to 1GB, but the job would just spawn N processes of 1GB each, sidestepping your limits. In the kernel shipped with Fedora 15, 10 controllers are active by default. For more information, you can check the documentation:

If you are a Redhat customer, I find the RHEL6 manual has the best cgroups documentation out there.

To see cgroups in action, use the systemd-cgls command found on Fedora 15. This will print out the current hierarchy of all cgroups. Here's what I see on my system (output truncated for display reasons):

├ condor
│ ├ 17948 /usr/sbin/condor_master -f
│ ├ 17949 condor_collector -f
│ ├ 17950 condor_negotiator -f
│ ├ 17951 condor_schedd -f
│ ├ 17952 condor_startd -f
│ ├ 17953 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 48...
│ └ 18224 condor_procd -A /var/run/condor/procd_pipe.STARTD -R 10000000 -S 60 -C 48...
├ user
│ ├ root
│ │ └ master
│ │   └ 6879 bash
│ └ bbockelm
│   ├ 1168
│   │ ├ 21426 sshd: bbockelm [priv]
│   │ ├ 21429 sshd: bbockelm@pts/3
│   │ ├ 21430 -bash
│   │ └ 21530 systemd-cgls
│   ├ 309
│   │ ├  1110 /usr/libexec/gvfsd-http --spawner :1.4 /org/gtk/gvfs/exec_spaw/0
│   │ ├  6198 gnome-terminal
│   │ ├  6202 gnome-pty-helper

(output trimmed)

└ system
  ├ 1 /bin/systemd --log-level info --log-target syslog-or-kmsg --system --dump...
  ├ sendmail.service
  │ ├ 8603 sendmail: accepting connections
  │ └ 8612 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
  ├ auditd.service
  │ ├ 8542 auditd
  │ ├ 8544 /sbin/audispd
  │ └ 8552 /usr/sbin/sedispatch
  ├ sshd.service
  │ └ 7572 /usr/sbin/sshd

(output trimmed)

All of the processes in my system are in the / cgroup; all login shells are placed inside a cgroup named

/user/$USERNAME

; each system service (such as ssh) is located inside a cgroup named

/system/$SERVICENAME

; finally, there's a special one named

/condor

; More on

/condor

later.

To see the cgroups for the current process, you can do the following:

[bbockelm@mydesktop ~]$ cat /proc/self/cgroup 
10:blkio:/
9:net_cls:/
8:freezer:/
7:devices:/
6:memory:/
5:cpuacct:/
4:cpu:/
3:ns:/
2:cpuset:/
1:name=systemd:/user/bbockelm/1168

Note that each processes is not necessarily in one cgroup. The rules are that a process can have one cgroup per mount, there is one or more controller per mount, and a controller can only be mounted once.

Each controller has statistics accessible via proc. For example, on Fedora 15, if I want to see how much memory all of my login shells are using, I can do the following:

[bbockelm@rcf-bockelman ~]$ cat /cgroups/memory/condor/memory.usage_in_bytes 
34365440

But what about the batch system?
I hope our readers can see the immediate utility in having a simple mechanism for unescapable process tracking. We examined one such mechanism before (adding a secondary GID per batch job), but it has a small drawback in that the secondary GID can be used to create permanent objects (files owned by the secondary GID) which outlive the lifetime of the batch job.

But, even in Part I of the series, we concluded that a perfect process tracking mechanism is not enough: we also need to be able to kill processes when the batch job is finished! The cgroups developer must have come to the same conclusion, as one controller is called the freezer. The freezer cgroup simply stops any process from receiving CPU time from the kernel. All process in the cgroups are frozen - and there is no way for a process to know it is about to freeze, as they aren't informed via signals. Hence, a process tracker can freeze the processes, send them all SIGKILL, and unfreeze them. All processes will end immediately; none will have the ability to hide in the /proc system or spawn new children in a race condition.

If you look at the first process tree posted, there is a cgroup called "condor". As I presented at Condor Week 2011, condor is now integrated with cgroups. It can be started in a cgroup the sysadmin specifies (such as /condor), and it will create a unique cgroup for each job (/cgroup/job_$CLUSTERID_$PROC_ID). It uses whatever controllers are active on the system to try and track memory consumption, CPU time, and block I/O. When the job ends or is killed, the freezer controller is used to clean up any processes.

Conclusions
As the disparate scientific clusters have become increasingly linked through the use of grids, improved process tracking has become more important. Many sites have users from across the nation; it's no longer possible for a sysadmin to be good friends with each user. Some have jobs with questionable quality; some have with virus-ridden laptops.

In the end, traditional process tracking in batch systems is not really ready for modern users. Most modern batch systems no longer rely solely on the original Unix grouping mechanisms, but will fall to user malicious users. The problem is not solvable only from user space.

Luckily, with the proc connector API (for any Linux 2.6 kernel) and cgroups (for recent Kernels), we can greatly improve the state of the art. The folks contributing to the Linux kernel is broad, but I understand much of the contributions for cgroups has come from the OpenVZ folks: thanks guys!.

As I've been exploring this subject, I have been implementing cgroup usage in Condor: I think it's a great new feature. They will be released with Condor 7.7.0, due in a few days. There's no reason other batch systems can't also adopt cgroups for process tracking: I hope the spread widely in the future!

Friday, June 24, 2011

Part II: Keeping a mindful eye on your users with ProcPolice.

In Part I of this series, we talked about the various mechanisms a batch system uses to track your job's processes, and concluded the state of the art isn't particularly impressive. The only way to go is up; this post discusses an improved technique for process tracking in Linux. It was motivated by this blog post from the author of upstart. If you feel inspired here, and would like to read some code, it is highly recommended reading.

The previous post went from bad to dire: most batch systems use methods that are easily defeated by the job changing its runtime environment (altering the process group, reparenting to init, or changing the environment). Even when using a reliable tracking method, killing an arbitrary set of processes is not possible.

To top it off - when process tracking or killing goes awry, we have no reliable means to detect when our methods fail.

There's a small, relatively unknown, corner of the Linux kernel that can help us out, the proc connector. A privileged process can connect a socket to the kernel, and receive a stream of messages about processes on the system. Any time one of the following system calls happens:

fork/clone
exec
exit
setuid
setgid
setsid

for a thread or a process (all the events are documented in linux/cn_proc.h in the kernel's sources), the socket receives a message containing all the relevant event details. By tracking only the the fork and exit events, one can build a process tree in memory, starting with the batch system worker process.

Because it is based on events from the kernel, not periodic polling of /proc from user space, this is a far more reliable method for tracking processes. With a little help from the kernel, the picture is already brighter!

The drawback here is that, while the message is being processed by user-space, further messages to the socket are buffered in memory. When the buffer is full, the kernel drops any further messages: the tracker will lose possibly important events. The event stream is asynchronous: the fork or exit occurs regardless of whether you process the associated message.. Unless the tracking code is particularly slow, it is likely the only case where the buffer overruns is the case where we care about: someone launching a fork bomb to escape the system.

If you have too many message, the first step is to receive less messages. One can hand the kernel a small program in an special assembly language that pre-filters messages: a message that isn't put into the queue can't overflow it! Writing these filters are a fun academic exercise, but not useful: when a fork bomb occurs, the messages the processes needs to receive are precisely the ones overflowing the buffer!

So while never failing is preferable, detecting when we have failed is acceptable: when the buffer is full, the next attempt to read a message from the socket will return a ENOBUF to indicate the buffer has overfilled. Actions can be taken: for most batch systems, a nasty email to the user and sysadmin might be sufficient. If you work for the NSA, perhaps the appropriate response is to power down the worker node and send out black helicopters.

I've taken the approach outlined here and turned it into a small package called "ProcPolice". It consists of a simple daemon which listens to the stream of events, and adds the process to an in-memory tree if the process can be tracked back to a batch system job. ProcPolice will detect when a process reparents to init and, if it is launched from the batch system and non-root, it can log the event or kill the newly-daemonized process. In testing, it is able stop off simple fork-bombs and detect more sophisticated ones.

As ProcPolice runs as a separate daemon that watches the batch system and intervenes only on daemonized processes, it can be used immediately with any batch system (Condor or PBS have been tested). ProcPolice is available in source code form from

svn://t2.unl.edu/brian/proc_police

Or as a RHEL5-compatible RPM.

ProcPolice was invented with a few specific requirements in mind:

Prevent batch system-based processes from outliving the lifetime of the batch system job without changing the runtime of the job itself.
Do this without support in the batch system itself.
Detect when failures occur.
Support RHEL5 (the OS used by the LHC for the next few years).

It turns out the last requirement is perhaps the most stringent one; newer kernels have a specific feature for tracking and controlling arbitrary sets of processes. This is the topic of the next part of this series.