OSG Technology Area Rumblings: September 2011

Introduction

The OSG takes a fairly abstract definition of a cloud:

A cloud is a service that provision resources on-demand for a marginal cost

The two important pieces of this definition are "resource provisioning" and "marginal cost". The most common cloud instance you'll run into is Amazon EC2, which provisions VMs; depending on the size of VM, the marginal cost is between $0.03 and $0.80 an hour.

The EC2 charge model is actually more complicated than just VMs-per-hour. There's additional charges for storage and network use. In controlled experiments last year, CMS determined the largest cost of using EC2 was not the CPU time, but the network usage.

This showed a glaring hole in OSG's current accounting: we only record wall and CPU time. For the rest of other metrics - which can't be estimated accurately by looking at wall time - we are blind.

Long story short - if OSG ever wants to provide a cloud service using our batch systems, we need better accounting.

Hence, we are running a technology investigation to bring batch system accounting up to par with EC2's: https://jira.opensciencegrid.org/browse/TECHNOLOGY-2

Our current target is to provide a proof-of-concept using Condor. With Condor 7.7.0's cgroup integration, the CPU/memory usage is very accurate, but network accounting for vanilla jobs is missing. Network accounting is the topic for this post; we have the following goals:

The accounting should be done for all processes spawned during the batch job.
All network traffic should be included.
Separately account LAN traffic from WAN traffic (in EC2, these have different costs).

The Woes of Linux Network Accounting

The state of Linux network accounting, well, sucks (for our purposes!). Here's a few ways to tackle it, and why each of them won't work:

Counting packets through an interface: If you assume that there is only one job per host, you can count the packets that go through a network interface. This is a big, currently unlikely, assumption.
Per-process accounting: There exists a kernel patch floating around on the internet that adds per-process in/out statistics. However, other than polling frequently, we have no mechanism to account for short-lived processes. Besides, asking folks to run custom kernels is a good way to get ignored.
cgroups: There is a net controller in cgroups. This marks packets in such a way that they can be manipulated by the tc utility. tc controls the layer of buffering before packets are transferred to the network card and can do accounting. Unfortunately:

In RHEL6, there's no way to persist tc rules.
This only accounts for outgoing packets; incoming packets do not pass through.
We cannot distinguish between local network traffic and off-campus network traffic. This can actually be overcome with a technique similar in difficulty to byte packet filters (BPF), but would be difficult.

ptrace or dynamic loader techniques: There exists libraries (exemplified by parrot) that provide a mechanism for intercepting calls. We could instrument this. However, this path is notoriously buggy and difficult to maintain: it would require a lot of code, and not work for statically-compiled processes.

The most full-featured network accounting is in the routing code controlled by iptables. Particularly, this can account incoming and outgoing traffic, plus differentiate between on-campus and off-campus traffic.

We're going to tackle the problem using iptables; the trick is going to be distinguishing all the traffic from a single batch job. As in the previous series on managing batch system processes, we are going borrow heavily from techniques used in Linux containers.

Per-Batch Job Network Statistics

To get perfect per-batch-job network statistics that differentiate between local and remote traffic, we will combine iptables, NAT, virtual ethernet devices, and network namespaces. It will somewhat be a tour-de-force of the Linux kernel networking - and currently very manual. Automation is still forthcoming.

This recipe is a synthesis of the ideas presented in the following pages:

Manually setting up networking for a container: http://lxc.sourceforge.net/index.php/about/kernel-namespaces/network/configuration/
Traffic accounting with iptables: http://www.catonmat.net/blog/traffic-accounting-with-iptables/
Using a NAT between the "container"

We'll be thinking of the batch job as a "semi-container": it will get its own network device like a container, but have more visibility to the OS than in a container. To follow this recipe, you'll need RHEL6 or later.

First, we'll create a pair of ethernet devices and set up NAT-based routing between them and the rest of the OS. We will assume eth0 is the outgoing network device and that the IPs 192.168.0.1 and 192.168.0.2 are currently not routed in the network.

Enable IP forwarding:
```
echo 1 > /proc/sys/net/ipv4/ip_forward
```
Create an veth ethernet device pair:
```
ip link add type veth
```
This will create two devices, veth0 and veth1, that act similar to a Unix pipe: bytes sent to veth1 will be received by veth0 (and vice versa).
Assign IPs to the new veth devices; we will use 192.168.0.1 and 192.168.0.2:
```
ifconfig veth0 192.168.0.1/24 up
ifconfig veth1 192.168.0.2/24 up
```
Download and compile ns_exec.c; this is a handy utility developed by IBM that allows us to create processes in new namespaces. Compilation can be done like this:
```
gcc -o ns_exec ns_exec.c
```
This requires a RHEL6 kernel and the kernel headers
In a separate window, launch a new shell in a new network and mount namespace:
```
./ns_exec -nm -- /bin/bash
```
We'll refer to this as shell 2 and our original window as shell 1.
Use ps to determine the pid of shell 2. In shell 1, execute:
```
ip link set veth1 netns $PID_OF_SHELL_2
```
In shell 2, you should be able to run ifconfig and see veth1.
In shell 2, re-mount the /sys filesystem and enable the loopback device:
```
mount -t sysfs none /sys
ifconfig lo up
```

At this point, we have a "batch job" (shell 2) with its own dedicated networking device. All traffic generated by this process - or its children - must pass through here. Traffic generated in shell 2 will go into veth1 and out veth0. However, we haven't hooked up the routing for veth0, so packets currently stop there; fairly useless.

Next, we create a NAT between veth0 and eth0. This is a point of convergence - alternately, we could bridge the networks at layer 2 or layer 3 and provide the job with its own public IP. I'll leave that as an exercise for the reader. For the NAT, I will assume that 129.93.0.0/16 is the on-campus network and everything else is off-campus. Everything will be done in shell 1:

Verify that any firewall won't be blocking NAT packets. If you don't know how to do that, turn off the firewall with
```
iptables -F
```
. If you want a firewall, but don't know how iptables works, then you probably want to spend a few hours learning first.

Enable the packet mangling for NAT:

iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Forward packets from veth0 to eth0, using separate rules for on/off campus:

iptables -A FORWARD -i veth0 -o eth0 --dst 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i veth0 -o eth0 ! --dst 129.93.0.0/16 -j ACCEPT

Forward TCP connections from eth0 to veth0 using separate rules:

iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED --src 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED ! --src 129.93.0.0/16 -j ACCEPT

At this point, you can switch back to shell 2 and verify the network is working. iptables will automatically do accounting; you just need to enable command line flags to get it printed:
iptables -L -n -v -x
If you look at the network accounting reference, they show how to separate all the accounting rules into a separate chain. This allows you to, for example, reset counters for only the traffic accounting. On my example host, the output looks like this:

Chain INPUT (policy ACCEPT 4 packets, 524 bytes)
    pkts      bytes target     prot opt in     out     source               destination        

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination        
      30     1570 ACCEPT     all  --  veth0  eth0    0.0.0.0/0            129.93.0.0/16      
      18     1025 ACCEPT     all  --  veth0  eth0    0.0.0.0/0           !129.93.0.0/16      
      28    26759 ACCEPT     all  --  eth0   veth0   129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED
      17    10573 ACCEPT     all  --  eth0   veth0  !129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED

Chain OUTPUT (policy ACCEPT 4 packets, 276 bytes)
    pkts      bytes target     prot opt in     out     source               destination

As you can see, my "job" has downloaded about 26KB from on-campus and 10KB from off-campus.

Viola! Network accounting appropriate for a batch system!

Thursday, September 8, 2011

Per-Batch Job Network Statistics