First, we start off with the lowly condor_starter on any worker node with an network connection (to simplify things, I didn't draw the other condor processes involved):
By default, all processes on the node are in the same network namespace (labelled the "System Network Namespace" in this diagram). We denote the network interface with a box, and assume it has address 192.168.0.1.
Next, the starter will create a pair of virtual ethernet devices. We will refer to them as pipe devices, because any byte written into one will come out of the other - just how a venerable Unix pipe works:
By default, the network pipes are in a down state and have no IP address associated with them. Not very useful! At this point, we have some decisions to make: how should the network pipe device be presented to the network? Should it be networked at layer 3, using NAT to route packets? Or should we bridge it at layer 2, allowing the device to have a public IP address?
Really, it's up to the site, but we assume most sites will want to take the NAT approach: the public IP address might seem useful, but would require a public IP for each job. To allow customization, all the routing is done by a helper script, but provide a default implementation for NAT. The script:
- Takes two arguments, a unique "job identifier" and the name of the network pipe device.
- Is responsible for setting up any routing required for the device.
- Must create an iptables chain using the same name of the "job identifier".
- Each rule in the chain will record the number of bytes matched; at the end of the job, these will be reported in the job ClassAd using an attribute name identical to the comment on the rule.
- On stdout, returns the IP address the internal network pipe should use.
Next, the starter forks a separate process in a new network namespace using the clone() call with the CLONE_NEWNET flag. Notice that, by default, no network devices are accessible in the new namespace:
Next, the external starter will pass one side of the pipe to the other namespace; the internal stater will do some minimal configuration of the device (default route, IP address, set the device to the "up" status):
Finally, the starter exec's to the job. Whenever the job does any network operations, the bytes are routed via the internal network pipe, come out the external network pipe, and then are NAT'd to the physical network device before exiting the machine.
As mentioned, the whole point of the exercise is to do network accounting. Since all packets go through one device, Condor can read out all the activity via iptables. The "helper script" above will create a unique chain per job. This allows some level of flexibility; for example, the chain below allows us to distinguish between on-campus and off-campus packets:
Chain JOB_12345 (2 references) pkts bytes target prot opt in out source destination 0 0 ACCEPT all -- veth0 em1 anywhere 22.214.171.124/16 /* OutgoingInternal */ 0 0 ACCEPT all -- veth0 em1 anywhere !126.96.36.199/16 /* OutgoingExternal */ 0 0 ACCEPT all -- em1 veth0 188.8.131.52/16 anywhere state RELATED,ESTABLISHED /* IncomingInternal */ 0 0 ACCEPT all -- em1 veth0 !184.108.40.206/16 anywhere state RELATED,ESTABLISHED /* IncomingExternal */ 0 0 REJECT all -- any any anywhere anywhere reject-with icmp-port-unreachable
Thus, the resulting ClassAd history from this job will have an attribute for NetworkOutgoingInternal, NetworkOutgoingExternal, NetworkIncomingInternal, and NetworkIncomingInternal. We have an updated Condor Gratia probe that looks for Network* attributes and reports them appropriately to the accounting database.
Thus, we have byte-level network, allowing us to answer the age-old question of "how much would a CMS T2 cost on Amazon EC2?". Or perhaps we could answer "how much is a currently running job going to cost me?" Matt has pointed out the network setup callout could be used to implement security zones, isolating (or QoS'ing) jobs of certain users at the network level. There are quite a few possibilities!
We'll definitely be returning to this work mid-2012 when the local T2 is based on SL6, and this patch can be put into production. There will be some further engagement with the Condor team to see if they're interested in taking the patch. The Gratia probe work to manage network information will be interesting upstream too. Finally, I encourage interested readers to take a look at the github branch. The patch itself is a tour-de-force of several dark corners of Linux systems programming (involves using clone, synchronization between processes with pipes, sending messages to the kernel via netlink to configure the routing, and reading out iptables configurations using C). It was very rewarding to implement!