OSG Technology Area Rumblings: accounting

Showing posts with label accounting. Show all posts

Thursday, September 8, 2011

Per-Batch Job Network Statistics

Introduction

The OSG takes a fairly abstract definition of a cloud:

A cloud is a service that provision resources on-demand for a marginal cost

The two important pieces of this definition are "resource provisioning" and "marginal cost". The most common cloud instance you'll run into is Amazon EC2, which provisions VMs; depending on the size of VM, the marginal cost is between $0.03 and $0.80 an hour.

The EC2 charge model is actually more complicated than just VMs-per-hour. There's additional charges for storage and network use. In controlled experiments last year, CMS determined the largest cost of using EC2 was not the CPU time, but the network usage.

This showed a glaring hole in OSG's current accounting: we only record wall and CPU time. For the rest of other metrics - which can't be estimated accurately by looking at wall time - we are blind.

Long story short - if OSG ever wants to provide a cloud service using our batch systems, we need better accounting.

Hence, we are running a technology investigation to bring batch system accounting up to par with EC2's: https://jira.opensciencegrid.org/browse/TECHNOLOGY-2

Our current target is to provide a proof-of-concept using Condor. With Condor 7.7.0's cgroup integration, the CPU/memory usage is very accurate, but network accounting for vanilla jobs is missing. Network accounting is the topic for this post; we have the following goals:

The accounting should be done for all processes spawned during the batch job.
All network traffic should be included.
Separately account LAN traffic from WAN traffic (in EC2, these have different costs).

The Woes of Linux Network Accounting

The state of Linux network accounting, well, sucks (for our purposes!). Here's a few ways to tackle it, and why each of them won't work:

Counting packets through an interface: If you assume that there is only one job per host, you can count the packets that go through a network interface. This is a big, currently unlikely, assumption.
Per-process accounting: There exists a kernel patch floating around on the internet that adds per-process in/out statistics. However, other than polling frequently, we have no mechanism to account for short-lived processes. Besides, asking folks to run custom kernels is a good way to get ignored.
cgroups: There is a net controller in cgroups. This marks packets in such a way that they can be manipulated by the tc utility. tc controls the layer of buffering before packets are transferred to the network card and can do accounting. Unfortunately:

In RHEL6, there's no way to persist tc rules.
This only accounts for outgoing packets; incoming packets do not pass through.
We cannot distinguish between local network traffic and off-campus network traffic. This can actually be overcome with a technique similar in difficulty to byte packet filters (BPF), but would be difficult.

ptrace or dynamic loader techniques: There exists libraries (exemplified by parrot) that provide a mechanism for intercepting calls. We could instrument this. However, this path is notoriously buggy and difficult to maintain: it would require a lot of code, and not work for statically-compiled processes.

The most full-featured network accounting is in the routing code controlled by iptables. Particularly, this can account incoming and outgoing traffic, plus differentiate between on-campus and off-campus traffic.

We're going to tackle the problem using iptables; the trick is going to be distinguishing all the traffic from a single batch job. As in the previous series on managing batch system processes, we are going borrow heavily from techniques used in Linux containers.

Per-Batch Job Network Statistics

To get perfect per-batch-job network statistics that differentiate between local and remote traffic, we will combine iptables, NAT, virtual ethernet devices, and network namespaces. It will somewhat be a tour-de-force of the Linux kernel networking - and currently very manual. Automation is still forthcoming.

This recipe is a synthesis of the ideas presented in the following pages:

Manually setting up networking for a container: http://lxc.sourceforge.net/index.php/about/kernel-namespaces/network/configuration/
Traffic accounting with iptables: http://www.catonmat.net/blog/traffic-accounting-with-iptables/
Using a NAT between the "container"

We'll be thinking of the batch job as a "semi-container": it will get its own network device like a container, but have more visibility to the OS than in a container. To follow this recipe, you'll need RHEL6 or later.

First, we'll create a pair of ethernet devices and set up NAT-based routing between them and the rest of the OS. We will assume eth0 is the outgoing network device and that the IPs 192.168.0.1 and 192.168.0.2 are currently not routed in the network.

Enable IP forwarding:
```
echo 1 > /proc/sys/net/ipv4/ip_forward
```
Create an veth ethernet device pair:
```
ip link add type veth
```
This will create two devices, veth0 and veth1, that act similar to a Unix pipe: bytes sent to veth1 will be received by veth0 (and vice versa).
Assign IPs to the new veth devices; we will use 192.168.0.1 and 192.168.0.2:
```
ifconfig veth0 192.168.0.1/24 up
ifconfig veth1 192.168.0.2/24 up
```
Download and compile ns_exec.c; this is a handy utility developed by IBM that allows us to create processes in new namespaces. Compilation can be done like this:
```
gcc -o ns_exec ns_exec.c
```
This requires a RHEL6 kernel and the kernel headers
In a separate window, launch a new shell in a new network and mount namespace:
```
./ns_exec -nm -- /bin/bash
```
We'll refer to this as shell 2 and our original window as shell 1.
Use ps to determine the pid of shell 2. In shell 1, execute:
```
ip link set veth1 netns $PID_OF_SHELL_2
```
In shell 2, you should be able to run ifconfig and see veth1.
In shell 2, re-mount the /sys filesystem and enable the loopback device:
```
mount -t sysfs none /sys
ifconfig lo up
```

At this point, we have a "batch job" (shell 2) with its own dedicated networking device. All traffic generated by this process - or its children - must pass through here. Traffic generated in shell 2 will go into veth1 and out veth0. However, we haven't hooked up the routing for veth0, so packets currently stop there; fairly useless.

Next, we create a NAT between veth0 and eth0. This is a point of convergence - alternately, we could bridge the networks at layer 2 or layer 3 and provide the job with its own public IP. I'll leave that as an exercise for the reader. For the NAT, I will assume that 129.93.0.0/16 is the on-campus network and everything else is off-campus. Everything will be done in shell 1:

Verify that any firewall won't be blocking NAT packets. If you don't know how to do that, turn off the firewall with
```
iptables -F
```
. If you want a firewall, but don't know how iptables works, then you probably want to spend a few hours learning first.

Enable the packet mangling for NAT:

iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Forward packets from veth0 to eth0, using separate rules for on/off campus:

iptables -A FORWARD -i veth0 -o eth0 --dst 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i veth0 -o eth0 ! --dst 129.93.0.0/16 -j ACCEPT

Forward TCP connections from eth0 to veth0 using separate rules:

iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED --src 129.93.0.0/16 -j ACCEPT
iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED ! --src 129.93.0.0/16 -j ACCEPT

At this point, you can switch back to shell 2 and verify the network is working. iptables will automatically do accounting; you just need to enable command line flags to get it printed:
iptables -L -n -v -x
If you look at the network accounting reference, they show how to separate all the accounting rules into a separate chain. This allows you to, for example, reset counters for only the traffic accounting. On my example host, the output looks like this:

Chain INPUT (policy ACCEPT 4 packets, 524 bytes)
    pkts      bytes target     prot opt in     out     source               destination        

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
    pkts      bytes target     prot opt in     out     source               destination        
      30     1570 ACCEPT     all  --  veth0  eth0    0.0.0.0/0            129.93.0.0/16      
      18     1025 ACCEPT     all  --  veth0  eth0    0.0.0.0/0           !129.93.0.0/16      
      28    26759 ACCEPT     all  --  eth0   veth0   129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED
      17    10573 ACCEPT     all  --  eth0   veth0  !129.93.0.0/16        0.0.0.0/0           state RELATED,ESTABLISHED

Chain OUTPUT (policy ACCEPT 4 packets, 276 bytes)
    pkts      bytes target     prot opt in     out     source               destination

As you can see, my "job" has downloaded about 26KB from on-campus and 10KB from off-campus.

Viola! Network accounting appropriate for a batch system!

Monday, June 13, 2011

Part I: How your batch system watches your processes (and why it's so bad at it)

Series Preamble

Almost every cluster sysadmin has faced a case of "users gone wild"; for us, it's almost always due to users abusing the shared file system or user processes escaping the watchful eye of the batch system. If I could prevent abuse of the shared file system while keeping it functional, I'd be a rich man. I'm not a rich man, so I'm going to be talking about the latter issue. This is a big topic, so I'm going to be splitting it up into a few posts:

Part I: How your batch system watches your processes (and why it's so bad at it).
Part II: Keeping a mindful eye on your users with ProcPolice.
Part III: Death of the fork-bomb: Ironclad process tracking in batch systems.

A few caveats up-front: I'm going to be talking about the platform I know (Linux-based OS's) and the batch systems we use (Condor, PBS, and a bit of SGE). Apologies to the Windows/obscure-Unix-variant/LSF users out there.

So, onward and upward!

Strategy: Process Groups
Each process on the system belongs to a process group, and the process groups are further grouped into a session (as in, a login session). Most batch systems, when starting a job, will start the job in a new session and a fresh process group. Process groups are at their most useful when sending signals: the batch system can send a signal (such as SIGKILL to terminate processes) to a process group. The kernel does the process tracking and appropriately signals all the processes in a group.

If this worked well, it would be a short blog series. Unfortunately, any processes can start a new process group - removing it from its previous group. To hide from the batch system, one can simply daemonize and start a new process group. Viola! You've now escaped.

Process groups are utilized by all batch systems; despite not working well, this is the most straightforward mechanism.

Strategy: Process Trees
Another "process trees". When a process is created, it has both a process ID and parent ID. By reading from the /proc filesystem (side note: Mac OS X has a far superior way of getting a programmatic snapshot of all system processes. Jealous.), the batch system can create a list of all processes on the system, their parents, and builds an in-memory tree of the parent-child relationships. By starting with the process the batch system launched for the job, and walking through the tree, the batch system can determine all the processes associated with a job. Creating and updating a process tree is poll-based: the operation is performed every X seconds (a typical value might be X=5). As long as the lifetime of all processes is X or more, the batch system will see all processes.

Well-behaved batch jobs stay within the process tree started by the batch system; this way, they get monitored by the batch system and their resource usage gets accounted.

The issue here, again, is badly behaved processes. When a process's parent exits, the child is assigned a new parent: init, or PID 1. To escape notice of a batch system that uses process trees, one needs to fork twice (creating a child and "grandchild"), then have the immediate child die. The grandchild is re-parented to init, as if it was a system daemon. Done quickly enough, and the grandchild has escaped the batch system.

The process tree strategy is used by Condor.

Strategy: Environment Cookies
A process's Unix environment is automatically inherited by its children, and remains unchanged if the parent exits. Condor currently takes advantage of these facts and inserts an extra environment variable into each batch system job. If you dump the environment of your current job using "env", you might see something like this:

_CONDOR_ANCESTOR_17948=17952:1307975354:2631244213
_CONDOR_ANCESTOR_17952=18260:1307976308:2791283533
_CONDOR_ANCESTOR_18260=18263:1307976309:1204008886

Each of these are environment variables used by Condor to track the process's ancestry. In this case, the condor_starter's PID is 18260 and the job's PID is 18263 (the other entries are from parents of the condor_starter process, the condor_startd and condor_master). Any sub-process started by the job will retain the _CONDOR_ANCESTOR_18260 variable by default.

When Condor polls the /proc filesystem to build a process tree, it can also read out the environment variables and use this information to build the process tree. As before, this relies on the user being friendly: if the environment variables are changed, then it again can escape the batch system.

Strategy: Supplementary Group IDs
Notice that all strategies so far involve some property of the process which is automatically inherited by its children (the process group, the process ancestry, or the Unix environment variables), but can be changed by the user's job.

A property inherited by subprocesses that cannot be changed without special privilege is the set of group IDs. Each process has a set of group IDs it is associated with it (if you look at the contents of /proc/self/status, you can see the groups associated with your terminal); it requires administrator privileges to add or remove group IDs, which the batch system has but the user does not.

Condor and SGE can be assigned a range of group IDs to hand out, and assign one of the IDs to the job process they launch. Assuming there is only one instance of the batch system on the node, any process with that group ID must have come from the batch job. So, when it comes time to kill batch jobs or perform accounting, we can map any process back to the batch system job.

While the user process cannot get rid of the ID, this setup is still possible to defeat (discussed below), and has a few drawbacks. The user process now has a new GID, and can create files using that GID; I have no clue how this might be useful, but it's a sign of misusing the GID concept. Anything that caches the user-to-groups mapping may get the wrong set of GIDs (as having unique per-process GIDs are rare, these caches may have broken assumptions). Finally, lays extra work on the sysadmin, who now must maintain a range of unused GIDs; they must sufficient to provide a GID per batch slot. Locally, we've run into the fact that the number of GIDs increases with the number of cores per node: what was a good setting last year is no longer sufficient.

Note that, with Condor, you can take this one step further and assign a unique user ID per batch slot, and run the job under that UID as opposed to the submitter's UID. This is a nightmare in terms of NFS-based shared file systems, but the approach at least works on both Unix and Windows.

How to defeat your batch system (inadvertently, right?)
Despite the drawbacks, the supplementary GID mechanism seems pretty foolproof: the user can no longer launch processes that can't be tracked back to a batch slot. However, this isn't sufficient to stop malicious users.

In order to kill all processes based on some attribute of the process (besides the process group), one must iterate through the contents of the /proc directory, read and parse the process's status file, and send a kill signal as appropriate. Ultimately, all batch systems currently do some variation of this; if you want a simple source code example, go lookup the sources of the venerable 'killall' utility.

The approach described above does have a fatal flaw: it is not atomic. Between looking at the contents of /proc, and opening /proc/PID/status, a process could have already forked another child and exited. Processes may have been spawned between the time when the directory iteration begins and ends, meaning they might never be seen.

Hence, a process may spawn more children in the time the batch system iterates through /proc and kills it; in fact, if the batch system is unlucky, they may do this fast enough the batch system may never detect the process exists in the first place! In the latter case, regardless of the tracking mechanism, the process may escape the batch system.

Worse, because these short-lived processes can be invisible to the batch system, the batch system may not detect it's being fooled; if the batch system could reliably detect the attack, it might be able to send an alert or turn off the worker node.

Ultimately, the batch system is defeated because it is trying to do process control from user-space. We lack three things:

Reliably track processes without changing the semantics of the job's runtime environment.
Atomically operations for determining and signaling a set of processes.
Detecting when (1) or (2) have failed.

Luckily, with a little help from the Linux kernel, we can overcome all three of the above issues. Item (2) takes a fairly modern kernel (2.6.24 or later), but items (1) and (3) can be accomplished with 2.6.0 or later.

As long as we have the ability to detect attacks as in (3), we can limp along until everyone gets onto a modern kernel: this is the topic of the next post. Stay tuned.