<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8803173202887660937</id><updated>2012-05-16T11:07:44.449-07:00</updated><category term='linux'/><category term='virtualization'/><category term='glexec'/><category term='batch system'/><category term='hold'/><category term='technology'/><category term='openstack'/><category term='Condor'/><category term='isolation'/><category term='security'/><category term='kernel'/><category term='cgroups'/><category term='gums'/><category term='hcc'/><category term='Introductions'/><category term='Investigations'/><category term='chroot'/><category term='Blueprint'/><category term='networking'/><category term='accounting'/><category term='system administration'/><title type='text'>OSG Technology Area Rumblings</title><subtitle type='html'>Updates on the activities of the OSG Technology Area, and life in Distributed High Throughput Computing in general.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>25</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-2804145841486035015</id><published>2012-03-10T11:28:00.000-08:00</published><updated>2012-03-10T11:28:31.896-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='isolation'/><category scheme='http://www.blogger.com/atom/ns#' term='Condor'/><category scheme='http://www.blogger.com/atom/ns#' term='cgroups'/><title type='text'>Resource Isolation in Condor using cgroups</title><content type='html'>This is the last in my series on job isolation techniques. &amp;nbsp;It has spanned in postings over the last month, so it may help to recap:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://osgtech.blogspot.com/2012/02/job-isolation-in-condor.html"&gt;Part I&lt;/a&gt; covered process isolation, prevent processes in one job from interacting with other jobs. &amp;nbsp;This has been achievable through POSIX mechanisms for awhile, but the new PID namespaces mechanisms provide improved isolation for jobs running as the same user.&lt;/li&gt;&lt;li&gt;&lt;a href="http://osgtech.blogspot.com/2012/02/file-isolation-using-bind-mounts-and.html"&gt;Part II&lt;/a&gt;&amp;nbsp;and &lt;a href="http://osgtech.blogspot.com/2012/02/improving-file-isolation-with-chroot.html"&gt;Part III&lt;/a&gt; discussed file isolation using bind mounts and chroots. &amp;nbsp;Condor uses bind mounts to remove access to "problematic" directories such as /tmp. &amp;nbsp;While more complex to setup, chroots allow jobs to run in a completely separate environment as the host and further isolates the job sandbox.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;This post will cover &lt;i&gt;resource isolation&lt;/i&gt;:&amp;nbsp;preventing jobs from consuming system resources promised to another job.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Condor has always had some crude form of resource isolation. &amp;nbsp;For example, the worker node could be configured to detect when the processes in a job have more CPU time than walltime (a rough indication that more than one core is being used) or when the sum of each process's virtual memory size exceeds the memory requested for the job. &amp;nbsp;When Condor detects too many resources are being consumed, it can take an action such as suspending or killing the job.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This traditional approach is relatively unsatisfactory for a few reasons:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Condor periodically polls to view resource consumption. &amp;nbsp;Any activity between polls is unmonitored.&lt;/li&gt;&lt;li&gt;The metrics Condor traditionally monitors are limited to memory and CPU, where the memory metrics are poor quality for complex jobs. &amp;nbsp;The sum many process's virtual memory size, on a modern Linux box, has little correlation with RAM used and is not particularly meaningful.&lt;/li&gt;&lt;li&gt;We can do little with the system besides detect when resource limits have been violated and kill the job.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;We cannot, for example, simply instruct the kernel to reduce the job's memory or CPU usage.&lt;/li&gt;&lt;li&gt;Accordingly, users must ask for &lt;b&gt;peak&lt;/b&gt;&amp;nbsp;resource usage, which may be well-above &lt;b&gt;average&lt;/b&gt;&amp;nbsp;resource usage, &lt;i&gt;decreasing overall throughput&lt;/i&gt;. &amp;nbsp;If the job needs 2GB on average but 4GB for a single second, the user will ask for 4GB; the other 2GB will be un-utilized.&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;In Linux, the oldest form of resource isolation is processor affinity or CPU pinning: a job can be locked to a specific CPU, and all its processes will inherit the affinity. &amp;nbsp;Because two jobs are locked to separate CPUs, they will never consume each others' CPU resources. &amp;nbsp;CPU pinning is unsatisfactory for reasons similar to memory: jobs can't utilize otherwise-idle CPUs, decreasing potential system throughput. &amp;nbsp;The granularity is also poor: you can't evenly fairshare 25 jobs on a machine with 24 cores as each job must be locked to at least one core. &amp;nbsp;However, it's a step forward - you don't need to kill jobs for using too much CPU - and present in Condor since 7.3.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Newer Linux kernels support &lt;a href="http://osgtech.blogspot.com/2011/07/part-iii-bulletproof-process-tracking.html"&gt;cgroups&lt;/a&gt;, which allow are structures for managing groups of processes, and provide &lt;i&gt;controllers&lt;/i&gt;&amp;nbsp;for managing resources in each cgroup. &amp;nbsp;In Condor 7.7.0, cgroup support was added for measuring resource usage. &amp;nbsp;When enabled, Condor will place each job into a dedicated cgroup for the block-I/O, memory, CPU, and "freezer" controllers. &amp;nbsp;&lt;a href="https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2734"&gt;We have implemented&lt;/a&gt;&amp;nbsp;two new limiting mechanisms based on the memory and CPU controllers.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The CPU controller provides a mechanism for fairsharing between different cgroups. &amp;nbsp;CPU shares are assigned to jobs based on the "slot weight" (by default, equal to the number of cores the job requested). &amp;nbsp;Thus, a job asking for 2 cores will get an average of 2 cores on a fully loaded system. &amp;nbsp;If there's an idle CPU, it could utilize more than 2 cores; however, it will never get less than what it requested for a significant amount of time. &amp;nbsp;CPU fairsharing provides a much finer granularity than pinning, easily allowing the jobs-to-cores ratio be non-integer.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The memory controller provides two kinds of limits: soft and hard. &amp;nbsp;When &lt;i&gt;soft&lt;/i&gt; limits are in place, the job can use an arbitrary amount of RAM until the host runs out of memory (and starts to swap); when this happens, only jobs over their limit are swapped out. &amp;nbsp;With &lt;i&gt;hard&lt;/i&gt;&amp;nbsp;limits, the job immediately starts swapping once it hits its RAM limit, regardless of the amount of free memory. &amp;nbsp;Both soft and hard limits default to the amount of memory requested for the job.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Both methods also have disadvantages. &amp;nbsp;Soft limits can cause "well-behaved" processes to wait while the OS frees up RAM from "badly behaving" process. &amp;nbsp;Hard limits can cause large amounts of swapping (for example, if there's a memory leak), decreasing the entire node's disk performance and thus adversely affecting other jobs. &amp;nbsp;In fact, it may be a better use of resources to preempt a heavily-swapping process and reschedule it on another node than let it continue running. &amp;nbsp;There is further room for improvement here in the future.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Regardless, cgroups and controllers provide a solid improvement in resource isolation for Condor, and finish up our series on job isolation. &amp;nbsp;Thanks for reading!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-2804145841486035015?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/2804145841486035015/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2012/03/resource-isolation-in-condor-using.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/2804145841486035015'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/2804145841486035015'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2012/03/resource-isolation-in-condor-using.html' title='Resource Isolation in Condor using cgroups'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-4833728643726960817</id><published>2012-02-27T11:33:00.000-08:00</published><updated>2012-02-27T11:33:56.265-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='isolation'/><category scheme='http://www.blogger.com/atom/ns#' term='Condor'/><category scheme='http://www.blogger.com/atom/ns#' term='security'/><category scheme='http://www.blogger.com/atom/ns#' term='chroot'/><title type='text'>Improving File Isolation with chroot</title><content type='html'>&lt;a href="http://osgtech.blogspot.com/2012/02/file-isolation-using-bind-mounts-and.html"&gt;In the last post&lt;/a&gt;, we examined a new Condor feature called &lt;i&gt;MOUNT_UNDER_SCRATCH&lt;/i&gt; that will isolate jobs from each other on the file system by making world-writable directories (such as &lt;i&gt;/tmp&lt;/i&gt; and &lt;i&gt;/var/tmp&lt;/i&gt;) be unique and isolated per-batch-job.&lt;br /&gt;&lt;br /&gt;That work started with the assumption that jobs from the same Unix user don't need to be isolated from each other. &amp;nbsp;This isn't necessarily true on the grid: a single, shared account per-VO is still popular on the OSG. &amp;nbsp;For such VOs, an attacker can gain additional credentials by reading the sandbox of each job running under the same Unix username.&lt;br /&gt;&lt;br /&gt;To combat proxy-stealing, we use an old Linux trick called a "&lt;b&gt;chroot&lt;/b&gt;". &amp;nbsp;A sysadmin can create a complete copy of the OS inside a directory, and an appropriately-privileged process can change the root of its filesystem ("&lt;i&gt;/&lt;/i&gt;") to that directory. &amp;nbsp;In fact, the phrase "changing root" where we get the "&lt;b&gt;chroot&lt;/b&gt;" terminology.&lt;br /&gt;&lt;br /&gt;For example, suppose the root of the system looks like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[root@localhost ~]# ls /&lt;br /&gt;bin &amp;nbsp; &amp;nbsp; cvmfs &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; hadoop-data2 &amp;nbsp;home &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;media &amp;nbsp;opt &amp;nbsp; selinux &amp;nbsp;usr&lt;br /&gt;boot &amp;nbsp; &amp;nbsp;dev &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; hadoop-data3 &amp;nbsp;lib &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; misc &amp;nbsp; proc &amp;nbsp;srv &amp;nbsp; &amp;nbsp; &amp;nbsp;var&lt;br /&gt;cgroup &amp;nbsp;etc &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; hadoop-data4 &amp;nbsp;lib64 &amp;nbsp; &amp;nbsp; &amp;nbsp; mnt &amp;nbsp; &amp;nbsp;root &amp;nbsp;sys&lt;br /&gt;chroot &amp;nbsp;hadoop-data1 &amp;nbsp;hadoop.log &amp;nbsp; &amp;nbsp;lost+found &amp;nbsp;net &amp;nbsp; &amp;nbsp;sbin &amp;nbsp;tmp&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The sysadmin can create a copy of the RHEL5 operating system inside a sub-directory; at our site, this is &lt;i&gt;/chroot/sl5-v3/root&lt;/i&gt;:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;[root@localhost ~]# ls /chroot/sl5-v3/root/&lt;br /&gt;bin &amp;nbsp; cvmfs &amp;nbsp;etc &amp;nbsp; lib &amp;nbsp; &amp;nbsp;media &amp;nbsp;opt &amp;nbsp; root &amp;nbsp;selinux &amp;nbsp;sys &amp;nbsp;usr&lt;br /&gt;boot &amp;nbsp;dev &amp;nbsp; &amp;nbsp;home &amp;nbsp;lib64 &amp;nbsp;mnt &amp;nbsp; &amp;nbsp;proc &amp;nbsp;sbin &amp;nbsp;srv &amp;nbsp; &amp;nbsp; &amp;nbsp;tmp &amp;nbsp;var&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note how the contents of the chroot directory are stripped down relative to the OS - we can remove dangerous binaries, sensitive configurations, or anything else unnecessary to running a job. &amp;nbsp;For example, many common Linux privilege escalation exploits come from the presence of a &lt;a href="http://en.wikipedia.org/wiki/Setuid"&gt;setuid binary&lt;/a&gt;. &amp;nbsp;Such binaries (&lt;b&gt;at&lt;/b&gt;, &lt;b&gt;cron&lt;/b&gt;, &lt;b&gt;ping&lt;/b&gt;) are necessary for managing the host, but not necessary for a running job. &amp;nbsp;By eliminating the setuid binaries from the chroot, a sysadmin can eliminate a common attack vector for processes running inside.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once the directory is built, we can call chroot and isolate ourselves from the host:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;[root@red-d15n6 ~]# &lt;b&gt;chroot&lt;/b&gt; /chroot/sl5-v3/root/&lt;br /&gt;bash-3.2# ls /&lt;br /&gt;bin &amp;nbsp; cvmfs &amp;nbsp;etc &amp;nbsp; lib&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt; &amp;nbsp;media &amp;nbsp;opt &amp;nbsp; root &amp;nbsp;selinux &amp;nbsp;sys &amp;nbsp;usr&lt;br /&gt;boot &amp;nbsp;dev &amp;nbsp; &amp;nbsp;home &amp;nbsp;lib64 &amp;nbsp;mnt&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt; proc &amp;nbsp;sbin &amp;nbsp;srv &amp;nbsp; &amp;nbsp; &amp;nbsp;tmp &amp;nbsp;var&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Condor, as of 7.7.5, now knows how to invoke the &lt;b&gt;chroot&lt;/b&gt;&amp;nbsp;syscall for user jobs. &amp;nbsp;However, as the job sandbox is written &lt;i&gt;outside&lt;/i&gt;&amp;nbsp;the chroot, we must somehow transport it inside before starting the job. &amp;nbsp;Bind mounts - &lt;a href="http://osgtech.blogspot.com/2012/02/file-isolation-using-bind-mounts-and.html"&gt;discussed last time&lt;/a&gt; - come to our rescue. &amp;nbsp;The entire process goes something like this:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Condor, as root, forks off a new child process.&lt;/li&gt;&lt;li&gt;The child uses the &lt;b&gt;unshare&lt;/b&gt; system call to place itself in a new filesystem namespace.&lt;/li&gt;&lt;li&gt;The child calls &lt;b&gt;mount&lt;/b&gt; to bind-mount the job sandbox inside the chroot. &amp;nbsp;Any other bind mounts - such as &lt;i&gt;/tmp&lt;/i&gt; or &lt;i&gt;/var/tmp&lt;/i&gt; - are done at this time.&lt;/li&gt;&lt;li&gt;The child will invoke the &lt;b&gt;chroot&lt;/b&gt; system call specifying the directory the sysadmin has configured.&lt;/li&gt;&lt;li&gt;The child drops privileges to the target batch system user, then calls &lt;b&gt;exec&lt;/b&gt;&amp;nbsp;to start&amp;nbsp;the user process.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;&lt;div&gt;&lt;a href="https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2822"&gt;With this patch applied&lt;/a&gt;, Condor will copy &lt;i&gt;only&lt;/i&gt; the job's sandbox forward into the filesystem namespace, meaning the job has access to no other sandbox (as all other sandboxes live outside the private namespace). &amp;nbsp;This successfully isolates jobs from each other's sandboxes, even if they run under the same Unix user!&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;br /&gt;The Condor feature is referred to as&amp;nbsp;&lt;a href="https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2698"&gt;&lt;i&gt;NAMED_CHROOT&lt;/i&gt;&lt;/a&gt;, as sysadmins can created multiple chroot-capable directories, give them a user-friendly name (such as &lt;i&gt;RHEL5&lt;/i&gt;, as opposed to&amp;nbsp;&lt;i&gt;/chroot/sl5-v3/root&lt;/i&gt;), and allow user jobs to ask for the directory by the friendly name in their submit file.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In addition to the security benefits, we have found the &lt;i&gt;NAMED_CHROOT&lt;/i&gt; feature allows us to run a RHEL5 job on a RHEL6 host without using virtualization; something for the future.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Going back to our original list of directories needing isolation - system temporary directories, job sandbox, shared filesystems, and GRAM directories - we have now isolated everything except the shared filesystems. The option here is simple, if unpleasant: mount the shared file system as read-only. &amp;nbsp;This is the modus operandi for &lt;b&gt;$OSG_APP&lt;/b&gt;&amp;nbsp;at many sites, and an acceptable (but not recommended) way to run &lt;b&gt;$OSG_DATA&lt;/b&gt;&amp;nbsp;(as &lt;b&gt;$OSG_DATA&lt;/b&gt; is optional anyway). &amp;nbsp;It restricts the functionality for the user, but brings us a step closer to our goal of job isolation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After file isolation, we have one thing left: resource isolation. &amp;nbsp;Again, a topic for the future.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-4833728643726960817?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/4833728643726960817/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2012/02/improving-file-isolation-with-chroot.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/4833728643726960817'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/4833728643726960817'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2012/02/improving-file-isolation-with-chroot.html' title='Improving File Isolation with chroot'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-5734694593802146494</id><published>2012-02-20T08:03:00.000-08:00</published><updated>2012-02-20T08:03:49.861-08:00</updated><title type='text'>File Isolation using bind mounts and chroots</title><content type='html'>The last post ended with a new technique for process-level isolation that unlocks our ability to safely use anonymous accounts and group accounts.&lt;br /&gt;&lt;br /&gt;However, that's not "safe enough" for us: the jobs can still interact with each other via the file system. &amp;nbsp;This post examines the directories where jobs can write into, and what can be done to remove this access.&lt;br /&gt;&lt;br /&gt;On a typical batch system node, a user can write into the following directories:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;System temporary directories&lt;/b&gt;: The Linux Filesystem Hierarchy Standard (FHS) provides at least two &lt;a href="http://en.wikipedia.org/wiki/Sticky_bit"&gt;sticky&lt;/a&gt;, world-writable directories, /tmp and /var/tmp. &amp;nbsp;These directories are traditionally unmanaged (user processes can write an uncontrolled amount of data here) and a security issue (symlink attacks and information leaks), even when user separation is in place.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Job Sandbox&lt;/b&gt;: This is a directory created by the batch system as a scratch location for the job. &amp;nbsp;The contents of the directory will be cleaned out by the batch system after the job ends. &amp;nbsp;For Condor, any user proxy, executable, or job stage-in files will be copied here prior to the job starting.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Shared Filesystems&lt;/b&gt;: For a non-grid site, this is typically at least $HOME, and some other site-specific directory. &amp;nbsp;$HOME is owned by the user running the job. &amp;nbsp;On the OSG, we also have $OSG_APP for application installation (typically read-only for worker nodes) and, optionally, $OSG_DATA for data staging (writable for worker nodes). &amp;nbsp;If they exist and are writable, $OSG_APP/DATA are owned by root and marked as sticky.&lt;/li&gt;&lt;li&gt;&lt;b&gt;GRAM directories&lt;/b&gt;: For non-Condor OSG sites, a few user-writable directories are needed to transfer the executable, proxy, and job stage-in files from the gatekeeper to the worker node. &amp;nbsp;These default to $HOME, but can be relocated to any shared filesystem directory. &amp;nbsp;For Condor-based OSG sites, this is a part of the job sandbox.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;If user separation is in place and considered sufficient, filesystem isolation is taken care of for shared filesystems, GRAM directories, and the job sandbox. &amp;nbsp;The systemwide temporary directories can be protected by mixing &lt;i&gt;&lt;a href="http://www.kernel.org/doc/man-pages/online/pages/man2/clone.2.html"&gt;filesystem namespaces&lt;/a&gt;&lt;/i&gt; and &lt;i&gt;&lt;a href="http://www.kernel.org/doc/man-pages/online/pages/man2/mount.2.html"&gt;bind mounts&lt;/a&gt;&lt;/i&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A &lt;a href="http://www.kernel.org/doc/man-pages/online/pages/man2/clone.2.html"&gt;process can be launched&lt;/a&gt; in its own filesystem namespace; such a process will have a copy of the system mount table. &amp;nbsp;Any change made to the process's mount table will not be seen by the outside system, and will be shared with any child processes.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For example, if the user's home directory is not mounted on the host, the batch system could create a process in a new filesystem namespace and mount the home directory in that namespace. &amp;nbsp;The home directory will be available to the batch job, but to no other process on the filesystem.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When the last process in the filesystem namespace exits, all mounts that are unique to that namespace will be unmounted. &amp;nbsp;In our example, when the batch job exits, the kernel will unmount the home directory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A &lt;a href="http://www.kernel.org/doc/man-pages/online/pages/man2/mount.2.html"&gt;bind mount&lt;/a&gt; makes a file or directory visible at another place in the filesystem - I think of it as mirroring the directory elsewhere. &amp;nbsp;We can take the job sandbox directory, create a sub-directory, and bind-mount the sub-directory over &lt;b&gt;/tmp&lt;/b&gt;. &amp;nbsp;The process is mostly equivalent to the following shell commands (where &lt;b&gt;$_CONDOR_SCRATCH_DIR&lt;/b&gt; is the location of the Condor job sandbox) in a filesystem namespace:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;mkdir $_CONDOR_SCRATCH_DIR/tmp&lt;br /&gt;mount --bind $_CONDOR_SCRATCH_DIR/tmp /tmp&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Afterward, any files a process creates in &lt;b&gt;/tmp&lt;/b&gt; will actually be stored in &lt;b&gt;$_CONDOR_SCRATCH_DIR/tmp&lt;/b&gt; - and cleaned up accordingly by Condor on job exit. &amp;nbsp;Any system process not in the job will not be able to see or otherwise interfere with the contents of the job's &lt;b&gt;/tmp&lt;/b&gt; unless it can write into &lt;b&gt;$_CONDOR_SCRATCH_DIR&lt;/b&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Condor refers to this feature as &lt;i&gt;MOUNT_UNDER_SCRATCH&lt;/i&gt;, and will be a part of the 7.7.5 release. &amp;nbsp;This will be an admin-specified list of directories on the worker node. &amp;nbsp;With it, the job will have a private copy of these directories, which will be backed by &lt;b&gt;$_CONDOR_SCRATCH_DIR&lt;/b&gt;. &amp;nbsp;The contents - and size - of these will be managed by Condor, just like anything else in the scratch directory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If user separation is unavailable or not considered sufficient (if there are, for example, group accounts), an additional layer of isolation is needed to protect the job sandbox. &amp;nbsp;A topic for a future day!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-5734694593802146494?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/5734694593802146494/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2012/02/file-isolation-using-bind-mounts-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/5734694593802146494'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/5734694593802146494'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2012/02/file-isolation-using-bind-mounts-and.html' title='File Isolation using bind mounts and chroots'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-864451098144380179</id><published>2012-02-14T09:46:00.000-08:00</published><updated>2012-02-14T09:46:01.864-08:00</updated><title type='text'>Job Isolation in Condor</title><content type='html'>I'd like to share a few exciting new features under construction for Condor 7.7.6 (or 7.9.0, as it may be).&lt;br /&gt;&lt;br /&gt;I've been working hard to improve the&lt;i&gt;&amp;nbsp;job isolation&lt;/i&gt;&amp;nbsp;techniques available in Condor. &amp;nbsp;My dictionary defines the verb "to isolate" as "to be or remain alone or apart from others"; when applied to the Condor context, we'd like to isolate each job from the others. &amp;nbsp;We'll define &lt;i&gt;process isolation&lt;/i&gt;&amp;nbsp;as&amp;nbsp;the inability of a process running in a batch job to interfere with a process not a part of the job. &amp;nbsp;Interfering with processes on Linux, loosely defined, means the sending of POSIX signals, taking control via the ptrace mechanism, or writing into the other process's memory.&lt;br /&gt;&lt;br /&gt;Process isolation is only one aspect of job isolation. &amp;nbsp;Job isolation also includes the inability to interfere with other jobs' files (&lt;i&gt;file isolation&lt;/i&gt;) and not being able to consume others' system resources such as CPU, memory, or disk (&lt;i&gt;resource isolation&lt;/i&gt;).&lt;br /&gt;&lt;br /&gt;In Condor, process isolation has historically been accomplished via one of two mechanisms:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Submitting user&lt;/b&gt;. &amp;nbsp;Jobs from Alice and Bob will be submitted as the unix users alice and bob, respectively. &amp;nbsp;In this model, the jobs running on the worker node will be run as users alice and bob, respectively. &amp;nbsp;The processes in the job running under user bob are protected from the processes in the job running as user alice via traditional POSIX security mechanisms.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;This model makes the assumption that jobs submitted by the same user do not need isolation from each other. &amp;nbsp;In other words, there shouldn't be any shared user accounts!&lt;/li&gt;&lt;li&gt;This model also assumes the submit host and the worker node share a common user namespace. &amp;nbsp;This can be more difficult to accomplish than it sounds: if the submit host has thousands of unique users, we must make sure each functions on the worker node. &amp;nbsp;If the submit host is on a remote site with a different user namespace from the worker node, this may not be easily achievable!&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Per-slot users&lt;/b&gt;. &amp;nbsp;Each "slot" (roughly corresponding to a CPU) in condor is assigned a unique unix user. &amp;nbsp;The job currently running in that slot is run under the associated username.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;This solves the "gotchas" noted above with the submitting user isolation model.&lt;/li&gt;&lt;li&gt;This is difficult to accomplish in-practice if the job wants to utilize a filesystem shared between the submit and worker nodes. &amp;nbsp;The filesystem security is based on two users having distinct Unix user names; in this model, there's no way to mark your files as only readable by your own jobs.&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;Notice both techniques require on user isolation to accomplish process isolation. &amp;nbsp;Condor has an oft-overlooked &lt;a href="http://research.cs.wisc.edu/condor/manual/v7.7/3_6Security.html#SECTION004613200000000000000"&gt;third mode&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Mapping remote users to nobody&lt;/b&gt;. &amp;nbsp;In this mode, local users (where the site admin can define the meaning of "local") get mapped to the submit host usernames, but non-local users all get mapped to user &lt;i&gt;nobody&lt;/i&gt; - the traditional unprivileged user on Linux.&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Local users can access all their files, but remote users only get access to the batch resources - no shared file systems.&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div&gt;Unfortunately, this is not a very secure mode as, according to the manual, the &lt;i&gt;nobody&lt;/i&gt; account&amp;nbsp;"...&lt;span style="background-color: white;"&gt;&amp;nbsp;may also be used by other Condor jobs running on the same machine, if it is a multi-processor machine"; not very handy advice in an age where your cell phone likely is a multi-processor machine!&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="background-color: white;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;This third mode is particularly attractive to us - we can avoid filesystem issues for our local users, but no longer have to create the thousands of accounts in our LDAP database for remote users. &amp;nbsp;However, since jobs from remote users run under the same unix user account, the traditional security mechanism of user separation does not apply - we need a new technique!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Enter &lt;a href="http://lwn.net/Articles/259217/" style="font-weight: bold;"&gt;PID namespaces&lt;/a&gt;, a new separation technique introduced in kernel 2.6.24. &amp;nbsp;By passing an &lt;a href="http://www.kernel.org/doc/man-pages/online/pages/man2/clone.2.html"&gt;additional flag&lt;/a&gt; when creating a new process, the kernel will assign an additional process ID (PID) to the child process. &amp;nbsp;The child will believe itself to be PID 1 (that is, when the child calls getpid(), it returns 1), while the processes in the parent's namespace will see a different PID. &amp;nbsp;The child will be able to spawn additional processes - all will be stuck in the same inner namespace - that similarly have an inner PID different from the outer one.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Processes within the namespace can only see and interfere (send signals, ptrace, etc) with other processes inside the namespace. &amp;nbsp;By launching the new job in its own PID namespace, Condor can achieve process isolation without user isolation: the job processes are isolated from all other processes on the system.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Perhaps the best way to visualize the impact of PID namespaces in the job is to examine the output of &lt;b&gt;ps&lt;/b&gt;:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;pre&gt;[bbockelm@localhost condor]$ condor_run ps faux&lt;br /&gt;USER &amp;nbsp; &amp;nbsp; &amp;nbsp; PID %CPU %MEM &amp;nbsp; &amp;nbsp;VSZ &amp;nbsp; RSS TTY &amp;nbsp; &amp;nbsp; &amp;nbsp;STAT START &amp;nbsp; TIME COMMAND&lt;br /&gt;bbockelm &amp;nbsp; &amp;nbsp; 1 &amp;nbsp;0.0 &amp;nbsp;0.0 114132 &amp;nbsp;1236 ? &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SNs &amp;nbsp;11:42 &amp;nbsp; 0:00 /bin/bash /home/bbockelm/.condor_run.3672&lt;br /&gt;bbockelm &amp;nbsp; &amp;nbsp; 2 &amp;nbsp;0.0 &amp;nbsp;0.0 115660 &amp;nbsp;1080 ? &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;RN &amp;nbsp; 11:42 &amp;nbsp; 0:00 ps faux&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;Only two processes can be seen from within the job - the shell executing the job script and "ps" itself.&lt;br /&gt;&lt;br /&gt;Releasing a PID namespaces-enabled Condor is an ongoing effort:&amp;nbsp;&lt;a href="https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1959"&gt;https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1959&lt;/a&gt;; I've recently re-designed the patch to be far less intrusive on the Condor internals by switching from the glibc clone() call to the clone syscall. &amp;nbsp;I am hopeful it will make it in the 7.7.6 / 7.9.0 timescale.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;From a process isolation point-of-view, with this patch, it now is safe to run jobs as user "nobody" or re-introduce the idea of shared "group accounts". &amp;nbsp;For example, we could map all CMS users to a single "cmsuser" account without having to worry about these becoming a vector for virus infection.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, the story of &lt;i&gt;job isolation&lt;/i&gt;&amp;nbsp;does not end with PID namespaces. &amp;nbsp;Stay tuned to find out how we are tackling file and resource isolation!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-864451098144380179?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/864451098144380179/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2012/02/job-isolation-in-condor.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/864451098144380179'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/864451098144380179'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2012/02/job-isolation-in-condor.html' title='Job Isolation in Condor'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-1284219792325627410</id><published>2012-01-27T10:29:00.000-08:00</published><updated>2012-01-27T10:36:26.796-08:00</updated><title type='text'>openstack - update</title><content type='html'>Last time I was able to deploy an image. Next step would be to list it and then run. But I have hit problems.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;To list images I run command:&lt;br /&gt;&lt;br /&gt; euca-describe-images&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;which hangs up forever and after long time exits with message "connection reset by peer".&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I have disabled iptables to eliminate firewall issues. No help.&lt;br /&gt;&lt;br /&gt;All manuals assume that euca-describe-images should simply run and do not give instruction what to do if it does not.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Following Josh's advice I did:&lt;br /&gt;&lt;br /&gt; strace -o edi_output -f -ff euca-describe-images&lt;br /&gt;&lt;br /&gt;and then I looked into the output files. It seems that there might be two problems:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Some euca2ools files are missing - in particular the .eucarc configuration file.&lt;/li&gt;&lt;li&gt;There are messages about missing python files, like for example "open("/usr/lib64/python2.6/site-packages/gtk-2.0/org.so", O_RDONLY) = -1 ENOENT (No such file or directory)" (There are manu more like that).&lt;/li&gt;&lt;/ol&gt;So it seems that the eucatools installation described in previous posts may be not complete - and it missed some key files. Or python (which we already know had to be patched) is not OK. Or both.&lt;br /&gt;&lt;br /&gt;That's all I know for now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-1284219792325627410?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/1284219792325627410/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2012/01/openstack-update.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/1284219792325627410'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/1284219792325627410'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2012/01/openstack-update.html' title='openstack - update'/><author><name>TomW</name><uri>http://www.blogger.com/profile/17076069319144757304</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-6608486540631647769</id><published>2012-01-10T08:47:00.000-08:00</published><updated>2012-01-10T08:53:04.117-08:00</updated><title type='text'>How to register an image in openstack</title><content type='html'>After having installed and configured the worker and controller nodes of the openstack testbed we would like to upload images into it.&lt;br /&gt;&lt;br /&gt;First I downloaded some images to /root/images on controller node.     One is from Xin and another one is a minimal image for testing I got     from the net. I have no idea what are they worth.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;   Then I tried to follow the instructions&lt;br /&gt; &lt;br /&gt;&lt;a class="moz-txt-link-freetext" href="http://docs.openstack.org/cactus/openstack-compute/admin/content/part-ii-getting-virtual-machines.html"&gt;http://docs.openstack.org/cactus/openstack-compute/admin/content/part-ii-getting-virtual-machines.html&lt;/a&gt;&lt;br /&gt; &lt;br /&gt;   which go like this:&lt;br /&gt;  &lt;br /&gt;    &lt;pre style="font-family: courier new;" class="literallayout"&gt;&lt;a id="d1542e1756"&gt;image="ubuntu1010-UEC-localuser-image.tar.gz"&lt;br /&gt;wget http://c0179148.cdn1.cloudfiles.rackspacecloud.com/ubuntu1010-UEC-localuser-image.tar.gz&lt;br /&gt;uec-publish-tarball $image [bucket-name] [hardware-arch]&lt;/a&gt;&lt;/pre&gt;   &lt;br /&gt; &lt;br /&gt;   and I could not find where does the&lt;br /&gt;   &lt;pre class="literallayout"&gt;&lt;a id="d1542e1756"&gt;uec-publish-tarball&lt;/a&gt;&lt;/pre&gt;   &lt;br /&gt;   command comes from. Finally I realized that it comes from Ubuntu and     the manual became Ubuntu specific without saying it explicitly.&lt;br /&gt; &lt;br /&gt; &lt;br /&gt;   So I tried different approach.&lt;br /&gt;&lt;br /&gt;   &lt;span style="font-family:courier new;"&gt;cd /root/images&lt;/span&gt;&lt;br /&gt;   &lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     glance add name="My Image" &amp;lt; sl61-kvm.tar.bz2 # the image I got     from Xin&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;   The command responded that the image got Id=1, which is a good sign.&lt;br /&gt; &lt;br /&gt;   Then I did:&lt;br /&gt; &lt;br /&gt;   &lt;span style="font-family:courier new;"&gt;glance show 1&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;   and got:&lt;br /&gt; &lt;br /&gt;   &lt;span style="font-family:courier new;"&gt;URI: &lt;/span&gt;&lt;a style="font-family: courier new;" class="moz-txt-link-freetext" href="http://0.0.0.0/images/1"&gt;http://0.0.0.0/images/1&lt;/a&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Id: 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Public: No&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Name: My Image&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Size: 199737477&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Location: &lt;/span&gt;&lt;a style="font-family: courier new;" class="moz-txt-link-freetext"&gt;file:///var/lib/glance/images/1&lt;/a&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Disk format: raw&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Container format: ovf&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;   Which suggests that the file is in the system. But when I tried:&lt;br /&gt; &lt;br /&gt;   &lt;span style="font-family:courier new;"&gt;glance index&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;   it said:&lt;br /&gt; &lt;br /&gt;   &lt;span style="font-family:courier new;"&gt;no public images found&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;   So I tried to register it again:&lt;br /&gt; &lt;br /&gt;   &lt;span style="font-family:courier new;"&gt; glance add name="My Image" is_public=true  &amp;lt; sl61-kvm.tar.bz2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Added new image with ID: 2&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;   I tried to list:&lt;br /&gt; &lt;br /&gt;   &lt;span style="font-family:courier new;"&gt;glance index&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     Found 1 public images...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     ID               Name                           Disk Format              Container Format     Size          &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     ---------------- ------------------------------ --------------------     -------------------- --------------&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;     2                My Image                       raw                      ovf                       199737477&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;   So it seems we have uploaded an image to the system.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now I have to figure out how to run it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-6608486540631647769?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/6608486540631647769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2012/01/how-to-register-image-in-openstack.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/6608486540631647769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/6608486540631647769'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2012/01/how-to-register-image-in-openstack.html' title='How to register an image in openstack'/><author><name>TomW</name><uri>http://www.blogger.com/profile/17076069319144757304</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-3018228315676012597</id><published>2012-01-06T07:17:00.000-08:00</published><updated>2012-01-06T09:37:49.106-08:00</updated><title type='text'>How to configure worker node - part 2</title><content type='html'>&lt;pre class="literallayout"&gt;&lt;a id="d1542e564"&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="font-weight: bold;"&gt;Compute node configuration - continued&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We execute the following commands:&lt;br /&gt;&lt;br /&gt;This command is supposed to synchronize the database:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;/usr/bin/nova-manage db sync &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-family:times new roman;"&gt;Now we have to create users and projects. We call both users and projects "nova"&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;/usr/bin/nova-manage user admin nova&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;/usr/bin/nova-manage project create nova nova &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;/usr/bin/nova-manage network create 192.168.0.0/24 1 256&lt;br /&gt;&lt;br /&gt;We check that users and projects were created correctly:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;/usr/bin/nova-manage project list&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;nova&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;/usr/bin/nova-manage user list&lt;br /&gt;nova&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="font-weight: bold;"&gt;Create Certifications&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:85%;"&gt;On the controller node execute&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;a id="d1542e564"&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;a id="d1542e575"&gt;mkdir –p /root/creds&lt;br /&gt;&lt;br /&gt;/usr/bin/python /usr/bin/nova-manage project zipfile nova nova /root/creds/novacreds.zip&lt;/a&gt;&lt;a id="d1542e564"&gt;&lt;br /&gt;&lt;br /&gt;If you encounter a python error, then apply the python patch described few posts earlier.&lt;br /&gt;&lt;br /&gt;Create /root/creds on the compute node and copy the&lt;br /&gt;&lt;/a&gt;&lt;a id="d1542e575"&gt;novacreds.zip&lt;/a&gt;&lt;a id="d1542e564"&gt; file there. Then unpack it&lt;br /&gt;&lt;br /&gt;&lt;/a&gt;&lt;a id="d1542e575"&gt;unzip /root/creds/novacreds.zip -d /root/creds/&lt;br /&gt;&lt;br /&gt;A few files will appear, among them&lt;br /&gt;&lt;/a&gt;&lt;a id="d1542e575"&gt;/root/creds/novarc . This file needs to be appended to .bashrc, but there is a catch:&lt;br /&gt;first line of the file has an error and has to be replaced:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Original line:&lt;br /&gt;&lt;br /&gt;NOVA_KEY_DIR=$(pushd $(dirname $BASH_SOURCE)&amp;gt;/dev/null; pwd; popd&amp;gt;/dev/null)&lt;br /&gt;&lt;br /&gt;has to be replaced with&lt;br /&gt;&lt;br /&gt;NOVA_KEY_DIR=~/creds&lt;br /&gt;&lt;br /&gt;The content of novarc file now is&lt;br /&gt;&lt;br /&gt;NOVA_KEY_DIR=~/creds&lt;br /&gt;&lt;br /&gt;export EC2_ACCESS_KEY="XXXXXXXXXXXXXXXXXXXXXXXX:nova"&lt;br /&gt;export EC2_SECRET_KEY="XXXXXXXXXXXXXXXXXXXXXXXX"&lt;br /&gt;export EC2_URL="http://130.199.148.53:8773/services/Cloud"&lt;br /&gt;export S3_URL="http://130.199.148.53:3333"&lt;br /&gt;export EC2_USER_ID=42 # nova does not use user id, but bundling requires it&lt;br /&gt;export EC2_PRIVATE_KEY=${NOVA_KEY_DIR}/pk.pem&lt;br /&gt;export EC2_CERT=${NOVA_KEY_DIR}/cert.pem&lt;br /&gt;export NOVA_CERT=${NOVA_KEY_DIR}/cacert.pem&lt;br /&gt;export EUCALYPTUS_CERT=${NOVA_CERT} # euca-bundle-image seems to require this set&lt;br /&gt;alias ec2-bundle-image="ec2-bundle-image --cert ${EC2_CERT} --privatekey ${EC2_PRIVATE_KEY} --user 42 --ec2cert ${NOVA_CERT}"&lt;br /&gt;alias ec2-upload-bundle="ec2-upload-bundle -a ${EC2_ACCESS_KEY} -s ${EC2_SECRET_KEY} --url ${S3_URL} --ec2cert ${NOVA_CERT}"&lt;br /&gt;export NOVA_API_KEY="XXXXXXXXXXXXXXXXXXXXXXXXXXX"&lt;br /&gt;export NOVA_USERNAME="nova"&lt;br /&gt;export NOVA_URL="http://130.199.148.53:8774/v1.0/"&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Where "XXXX.." strings denote keys which I do not post here, for security.&lt;br /&gt;&lt;br /&gt;The content of novarc file should now be added to bashrc:&lt;br /&gt;&lt;br /&gt;&lt;/a&gt;&lt;a id="d1542e575"&gt;cat /root/creds/novarc &amp;gt;&amp;gt; ~/.bashrc source ~/.bashrc&lt;br /&gt;&lt;br /&gt;This should be done both on compute and controller nodes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Enable access to worker node&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;First unset a proxy and then do:&lt;br /&gt;&lt;br /&gt;&lt;a id="d1542e583"&gt;euca-authorize -P icmp -t -1:-1 default euca-authorize -P tcp -p 22 default&lt;/a&gt;&lt;a id="d1542e564"&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/a&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-3018228315676012597?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/3018228315676012597/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2012/01/how-to-configure-worker-node-part-2.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/3018228315676012597'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/3018228315676012597'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2012/01/how-to-configure-worker-node-part-2.html' title='How to configure worker node - part 2'/><author><name>TomW</name><uri>http://www.blogger.com/profile/17076069319144757304</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-6934297547717816467</id><published>2012-01-05T12:49:00.000-08:00</published><updated>2012-01-05T13:09:38.759-08:00</updated><title type='text'>How to configure worker node</title><content type='html'>In the following I will describe how to configure the worker node. I assume that the worker node has been already installed following the instructions posted on this blog.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Firs of all, before we start, we still need to add nova-network (it has not been installed so far).&lt;br /&gt;&lt;br /&gt;Do:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;yum install openstack-nova-network&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Once this is done, we can go on and edit the &lt;span style="font-family: courier new;"&gt;/etc/nova/nova.conf&lt;/span&gt; file.&lt;br /&gt;&lt;br /&gt;First, add to the file the option&lt;br /&gt;&lt;br /&gt;&lt;pre class="literallayout"&gt;&lt;a id="d1542e508"&gt;--daemonize=1 &lt;/a&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The relevant switches are:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--sql_connection&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--s3_host&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--rabbit_host&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--ec2_api&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--ec2_url&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--fixed_range&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--network_size&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In the end the configuration file should look like:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--auth_driver=nova.auth.dbdriver.DbDriver&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--buckets_path=/var/lib/nova/buckets&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--ca_path=/var/lib/nova/CA&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--cc_host=&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--credentials_template=/usr/share/nova/novarc.template&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--daemonize=1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--dhcpbridge_flagfile=/etc/nova/nova.conf&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--dhcpbridge=/usr/bin/nova-dhcpbridge&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--ec2_api=130.199.148.53&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--ec2_url=http://130.199.148.53:8773/services/Cloud&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--fixed_range=192.168.0.0/16&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--glance_host=&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--glance_port=9292&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--image_service=nova.image.glance.GlanceImageService&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--images_path=/var/lib/nova/images&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--injected_network_template=/usr/share/nova/interfaces.rhel.template&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--instances_path=/var/lib/nova/instances&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--keys_path=/var/lib/nova/keys&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--libvirt_type=kvm&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--libvirt_xml_template=/usr/share/nova/libvirt.xml.template&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--lock_path=/var/lib/nova/tmp&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--logdir=/var/log/nova&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--logging_context_format_string=%(asctime)s %(name)s: %(levelname)s [%(request_id)s %(user)s %(project)s] %(message)s&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--logging_debug_format_suffix=&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--logging_default_format_string=%(asctime)s %(name)s: %(message)s&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--network_manager=nova.network.manager.VlanManager&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--networks_path=/var/lib/nova/networks&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--network_size=8&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--node_availability_zone=nova&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--rabbit_host=130.199.148.53&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--routing_source_ip=130.199.148.53&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--s3_host=130.199.148.53&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--scheduler_driver=nova.scheduler.zone.ZoneScheduler&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--sql_connection=mysql://{USER}:{PWD}@130.199.148.53/{DATABASE}&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--state_path=/var/lib/nova&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--use_cow_images=true&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--use_ipv6=false&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--use_s3=true&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--use_syslog=false&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--verbose=false&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;--vpn_client_template=/usr/share/nova/client.ovpn.template&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;where {USER},{PWD} and {DATABASE} denote nova database user, pasword and database name.&lt;br /&gt;&lt;br /&gt;Now go to the controller node and open the following ports for incoming connections: 3333,3306,5672,8773,8000.&lt;br /&gt;&lt;br /&gt;Go back to worker node and prepare &lt;span style="font-family: courier new;"&gt;/root/bin/openstack-init.sh&lt;/span&gt; script with the following content:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;#!/bin/bash&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;for n in ajax-console-proxy compute vncproxy network; do&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;    service openstack-nova-$n $@;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;done&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Then run&lt;br /&gt;&lt;br /&gt;/&lt;span style="font-family: courier new;"&gt;root/bin/openstack-init.sh stop&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Stopping OpenStack Nova Web-based serial console proxy:    [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Stopping OpenStack Nova Compute Worker:                    [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Stopping OpenStack Nova VNC Proxy:                         [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Stopping OpenStack Nova Network Controller:                [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;[root@gridreserve30 compute]# /root/bin/openstack-init.sh start&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Web-based serial console proxy:    [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Compute Worker:                    [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova VNC Proxy:                         [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Network Controller:                [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;to be continued...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-6934297547717816467?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/6934297547717816467/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2012/01/how-to-configure-worker-node.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/6934297547717816467'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/6934297547717816467'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2012/01/how-to-configure-worker-node.html' title='How to configure worker node'/><author><name>TomW</name><uri>http://www.blogger.com/profile/17076069319144757304</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-2894650495289033883</id><published>2011-12-29T17:12:00.000-08:00</published><updated>2011-12-29T17:12:19.928-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='system administration'/><category scheme='http://www.blogger.com/atom/ns#' term='hold'/><category scheme='http://www.blogger.com/atom/ns#' term='Condor'/><title type='text'>What's the hold-up?</title><content type='html'>Do you have the following diagram memorized?&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-HXJdhJQRwaU/Tv0HaKXnHsI/AAAAAAAAAeo/36XqVjRgO2I/s1600/condor_startd_policy_states.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="385" src="http://1.bp.blogspot.com/-HXJdhJQRwaU/Tv0HaKXnHsI/AAAAAAAAAeo/36XqVjRgO2I/s400/condor_startd_policy_states.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;If your site runs Condor, you probably should. &amp;nbsp;It shows the states of the &lt;b&gt;condor_startd&lt;/b&gt;, the activities within the state, and the transitions between them. &amp;nbsp;If you want to have jobs reliably pre-empted (or is that killed? &amp;nbsp;Or vacated?) from the worker node for something like memory usage, a clear understanding is required.&lt;br /&gt;&lt;br /&gt;However, the 30 state transitions might be a bit much for some site admins who just want to kill jobs that go over a memory limit. &amp;nbsp;In such a case, admins can utilize the &lt;i&gt;SYSTEM_PERIODIC_REMOVE&lt;/i&gt; or the &lt;i&gt;SYSTEM_PERIODIC_HOLD&lt;/i&gt; configuration parameters on the &lt;b&gt;condor_schedd&lt;/b&gt; to respectively remove or hold jobs.&lt;br /&gt;&lt;br /&gt;These expressions periodically evaluate the schedd's copy of the job ClassAd (by default, once every 60s); if they evaluate to true for a given job, they will remove or hold it. &amp;nbsp;This will almost immediately preempt execution on the worker node.&lt;br /&gt;&lt;br /&gt;[Note: While effective and simple, these are &lt;i&gt;not&lt;/i&gt;&amp;nbsp;the best way to accomplish these sort of policies! &amp;nbsp;As the worker node may talk to multiple &lt;b&gt;schedd&lt;/b&gt;'s (via flocking, or just through a complex pool with many schedd's), it's best to express the node's preferences locally.]&lt;br /&gt;&lt;br /&gt;At HCC, the periodic hold and release policy looks like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;# hold jobs using absurd amounts of disk (100+ GB)&lt;br /&gt;SYSTEM_PERIODIC_HOLD = \&lt;br /&gt;&amp;nbsp; &amp;nbsp;(JobStatus == 1 || JobStatus == 2) &amp;amp;&amp;amp; ((DiskUsage &amp;gt; 100000000 || ResidentSetSize &amp;gt; 1600000))&lt;br /&gt;&lt;br /&gt;# forceful removal of running after 2 days, held jobs after 6 hours,&lt;br /&gt;# and anything trying to run more than 10 times&lt;br /&gt;SYSTEM_PERIODIC_REMOVE = \&lt;br /&gt;   (JobStatus == 5 &amp;amp;&amp;amp; CurrentTime - EnteredCurrentStatus &amp;gt; 3600*6) || \&lt;br /&gt;   (JobStatus == 2 &amp;amp;&amp;amp; CurrentTime - EnteredCurrentStatus &amp;gt; 3600*24*2) || \&lt;br /&gt;   (JobStatus == 5 &amp;amp;&amp;amp; JobRunCount &amp;gt;= 10) || \&lt;br /&gt;   (JobStatus == 5 &amp;amp;&amp;amp; HoldReasonCode =?= 14 &amp;amp;&amp;amp; HoldReasonSubCode =?= 2)&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;We place anything on hold that goes over some pre-defined resource limit (disk usage or memory usage). &amp;nbsp;Jobs are removed if they have been on hold for a long time, have run for too long, have restarted too many times, or are missing their input files.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that this is a flat policy for the cluster - heterogeneous nodes with larges amounts of RAM per core would not be well-utilized. &amp;nbsp;We could tweak this by having users utilize the &lt;i&gt;RequestMemory&lt;/i&gt; attribute to their job's ad (defaulting to 1.6GB), place into the &lt;i&gt;Requirements&lt;/i&gt; that the slot have sufficient memory, and have the node only accept jobs that request memory below a certain threshold. &amp;nbsp;The expression above could then be tweaked to hold jobs where &lt;i&gt;(ResidentSetSize &amp;gt; RequestMemory)&lt;/i&gt;. &amp;nbsp;Perhaps more on that in the future if we go this route.&lt;br /&gt;&lt;br /&gt;While the &lt;i&gt;SYSTEM_PERIODIC_*&lt;/i&gt; expressions are useful, Dan Bradley recently introduce me to the &lt;a href="https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2185"&gt;&lt;i&gt;SYSTEM_PERIODIC_*_REASON&lt;/i&gt; parameter.&lt;/a&gt; &amp;nbsp;This allows you to build a custom hold message for the user whose jobs you're about to interrupt. &amp;nbsp;The expression is evaluated within the context of the job's ad, and the resulting string is placed in the job's &lt;i&gt;HOLD_REASON&lt;/i&gt;. &amp;nbsp;As an example, previously, the hold message was something bland and generic:&lt;br /&gt;&lt;br /&gt;The SYSTEM_PERIODIC_HOLD &amp;nbsp;expression evaluated to true.&lt;br /&gt;&lt;br /&gt;Why did it evaluate to true? &amp;nbsp;Was it memory or disk usage? &amp;nbsp;When it was held, how bad was the disk/memory usage? &amp;nbsp;These things can get lost in the system. &amp;nbsp;&lt;a href="https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2725"&gt;Oops&lt;/a&gt;. &amp;nbsp;We added the following to our schedd's configuration:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;# Report why the job went on hold.&lt;br /&gt;SYSTEM_PERIODIC_HOLD_REASON = \&lt;br /&gt;&amp;nbsp; &amp;nbsp;strcat("Job in status ", JobStatus, \&lt;br /&gt;&amp;nbsp; &amp;nbsp;" put on hold by SYSTEM_PERIODIC_HOLD due to ", \&lt;br /&gt;&amp;nbsp; &amp;nbsp;ifThenElse(isUndefined(DiskUsage) || DiskUsage &amp;lt; 100000000, \&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; strcat("memory usage ", ResidentSetSize), \&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; strcat("disk usage ", DiskUsage)), ".")&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, we have beautiful error messages in the user's logs explaining the issue:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;Job in status 2 put on hold by SYSTEM_PERIODIC_HOLD due to memory usage 1620340."&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;One less thing to get confused about!&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-2894650495289033883?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/2894650495289033883/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/12/whats-hold-up.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/2894650495289033883'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/2894650495289033883'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/12/whats-hold-up.html' title='What&apos;s the hold-up?'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-HXJdhJQRwaU/Tv0HaKXnHsI/AAAAAAAAAeo/36XqVjRgO2I/s72-c/condor_startd_policy_states.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-4646684131624655729</id><published>2011-12-23T06:14:00.000-08:00</published><updated>2011-12-23T06:14:49.840-08:00</updated><title type='text'>A simple iRODS Micro-Service</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;span style="font-size: large;"&gt;Introduction&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The goal I had for this task was to identify and understand the steps and configurations involved in writing a micro-service and seeing it in action - for details regarding iRODS please refer to documentation at &lt;a href="https://www.irods.org/"&gt;https://www.iRODS.org/&lt;/a&gt;. The micro-service that I wrote is very simplistic (it writes a hello world message to the system log), however it serves its purpose by providing an overview of steps that will be involved in writing a useful micro-service.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Before I document the configurations and codes involved in creating and registering the new micro-service let’s look at figure 1. &lt;br /&gt;&lt;br /&gt;&lt;img height="176px;" src="https://lh3.googleusercontent.com/7ug-Iwe3O_50aQrfrC46oo6ujLIIeDOIULiu_yeVMsDwtycKuXswtB5fFCeFWPZtTkgCGAtkUSDtRNLdzJJH-MsrvzCBjANMvl6Fre4xHJioC38ajSw" width="576px;" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Figure 1 shows a high level view of&amp;nbsp; invocation of a micro-service by the iRODS rules engine. One way of looking at the micro-service and the iRODS rule engine is to think of it as an event based triggering system that can perform ‘operations’ on the data objects, and/or external resources. The micro-services are registered in iRODS rule definitions and the rule engine invokes them based on the condition specified for that rule. For a list of places in the iRODS workflow where a micro-service may be triggered please visit: &lt;a href="https://www.irods.org/index.php/Default_iRODS_Rules"&gt;https://www.irods.org/index.php/Default_iRODS_Rules&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Also you may refer to &lt;a href="https://www.irods.org/index.php/Rule_Engine"&gt;https://www.iRODS.org/index.php/Rule_Engine&lt;/a&gt; for a detailed diagram of a micro-service invocation.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;img height="242px;" src="https://lh5.googleusercontent.com/uf3vI967QtzS1obJj-R3PYfL6KRUYs5O4P_1iISdEYXXzwnPxFdT9o--8j__edpPSJxxYeOeNj7DxreQBM1HXB8O27ZmD26zq_-7iPRJpBdGlVL7zMo" width="576px;" /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Figure 2 above shows the communication between the iRODS rule engine and a micro-service. A simplistic view of the communication layers is that the rule engine calls a defined C procedure, which exposes its functionality through an interface (commonly prefixed with msi). The arguments to the procedure are passed through a structure named &lt;i&gt;msParam_t&lt;/i&gt; that is defined below:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;typedef struct MsParam {&lt;br /&gt;&amp;nbsp; char *label;&lt;br /&gt;&amp;nbsp; char *type;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; /* this is the name of the packing instruction in&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; * rodsPackTable.h */&lt;br /&gt;&amp;nbsp; void *inOutStruct;&lt;br /&gt;&amp;nbsp; bytesBuf_t *inpOutBuf;&lt;br /&gt;} msParam_t;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="font-size: large;"&gt;Writing the micro-service&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Figure 3 shows the steps involved in creating a new micro-service:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;img height="129px;" src="https://lh4.googleusercontent.com/W6ZpoyTbAvhsPXnwI2_bJg7hgdTF3eOkL1tWN-aF7Cl10NidSpM8n2oKKOifxZhX5bruPK-IZHOSQOe525sMJEgkjP5yQacPF1tThetlEiRy9K4pjOM" width="576px;" /&gt;&lt;/div&gt;&lt;b&gt;Write the C procedure&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The C code below (lets call it test.c) has a function writemessage that writes a message to the system log. There is an interface to the function named msiWritemessage which exposes the writemessage function. The msi function takes a list of arguments of type msParam_t and a last argument of type ruleExecInfo_t for the result of the operation. &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;#include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;#include &amp;lt;unistd.h&amp;gt;&lt;br /&gt;#include &amp;lt;syslog.h&amp;gt;&lt;br /&gt;#include &amp;lt;string.h&amp;gt;&lt;br /&gt;#include "apiHeaderAll.h"&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;void writemessage(char arg1[], char arg2[]);&lt;br /&gt;int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,&amp;nbsp; ruleExecInfo_t *rei);&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;void writemessage(char arg1[], char arg2[]) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; openlog("slog", LOG_PID|LOG_CONS, LOG_USER);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; syslog(LOG_INFO, "%s %s from micro-service", arg1, arg2);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; closelog();&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,&amp;nbsp; ruleExecInfo_t *rei)&lt;br /&gt;{&lt;br /&gt;&amp;nbsp;char *in1;&lt;br /&gt;&amp;nbsp;int *in2;&lt;br /&gt;&amp;nbsp;RE_TEST_MACRO ("&amp;nbsp;&amp;nbsp;&amp;nbsp; Calling Procedure");&lt;br /&gt;&amp;nbsp;// the above line is needed for loop back testing using irule -i option&lt;br /&gt;&amp;nbsp;if ( strcmp( mParg1-&amp;gt;type, STR_MS_T ) == 0 )&lt;br /&gt;&amp;nbsp;{&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; in1 = (char*) mParg1-&amp;gt;inOutStruct;&lt;br /&gt;&amp;nbsp;}&lt;br /&gt;&amp;nbsp;if ( strcmp( mParg2-&amp;gt;type, INT_MS_T ) == 0 )&lt;br /&gt;&amp;nbsp;{&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; in2 = (int*) mParg2-&amp;gt;inOutStruct;&lt;br /&gt;&amp;nbsp;}&lt;br /&gt;&amp;nbsp;writemessage(in1, in1);&lt;br /&gt;&amp;nbsp;return rei-&amp;gt;status;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Next I will make a folder structure in the &lt;i&gt;module&lt;/i&gt; folder of iRODS home for placing this micro-service and copy a few files from an example &lt;i&gt;properties&lt;/i&gt; module and modify them to fit the test.c micro-service&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;cd ~irods&lt;br /&gt;mkdir modules/HCC&lt;br /&gt;cd modules/HCC&lt;br /&gt;&lt;br /&gt;mkdir microservices&lt;br /&gt;mkdir rules&lt;br /&gt;mkdir lib&lt;br /&gt;mkdir clients&lt;br /&gt;mkdir servers&lt;br /&gt;&lt;br /&gt;mkdir microservices/src&lt;br /&gt;mkdir microservices/include&lt;br /&gt;mkdir microservices/obj&lt;br /&gt;cp ../properties/Makefile .&lt;br /&gt;cp ../properties/info.txt .&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Listed below is my working copy of Makefile and the info.txt&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;#Makefile&lt;br /&gt;ifndef buildDir&lt;br /&gt;buildDir = $(CURDIR)/../..&lt;br /&gt;endif&lt;br /&gt;&lt;br /&gt;include $(buildDir)/config/config.mk&lt;br /&gt;include $(buildDir)/config/platform.mk&lt;br /&gt;include $(buildDir)/config/directories.mk&lt;br /&gt;include $(buildDir)/config/common.mk&lt;br /&gt;&lt;br /&gt;#&lt;br /&gt;# Directories&lt;br /&gt;#&lt;br /&gt;MSObjDir =&amp;nbsp;&amp;nbsp;&amp;nbsp; $(modulesDir)/HCC/microservices/obj&lt;br /&gt;MSSrcDir =&amp;nbsp;&amp;nbsp;&amp;nbsp; $(modulesDir)/HCC/microservices/src&lt;br /&gt;MSIncDir =&amp;nbsp;&amp;nbsp;&amp;nbsp; $(modulesDir)/HCC/microservices/include&lt;br /&gt;&lt;br /&gt;# Source files&lt;br /&gt;&lt;br /&gt;OBJECTS =&amp;nbsp;&amp;nbsp;&amp;nbsp; $(MSObjDir)/test.o&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# Compile and link flags&lt;br /&gt;#&lt;br /&gt;INCLUDES +=&amp;nbsp;&amp;nbsp;&amp;nbsp; $(INCLUDE_FLAGS) $(LIB_INCLUDES) $(SVR_INCLUDES)&lt;br /&gt;CFLAGS_OPTIONS := $(CFLAGS) $(MY_CFLAG)&lt;br /&gt;CFLAGS =&amp;nbsp;&amp;nbsp;&amp;nbsp; $(CFLAGS_OPTIONS) $(INCLUDES) $(MODULE_CFLAGS)&lt;br /&gt;&lt;br /&gt;.PHONY: all server client microservices clean&lt;br /&gt;.PHONY: server_ldflags client_ldflags server_cflags client_cflags&lt;br /&gt;.PHONY: print_cflags&lt;br /&gt;&lt;br /&gt;# Build everytying&lt;br /&gt;all:&amp;nbsp;&amp;nbsp;&amp;nbsp; microservices&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @true&lt;br /&gt;&lt;br /&gt;# List module's objects and needed libs for inclusion in clients&lt;br /&gt;client_ldflags:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @true&lt;br /&gt;&lt;br /&gt;# List module's includes for inclusion in the clients&lt;br /&gt;client_cflags:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @true&lt;br /&gt;&lt;br /&gt;# List module's objects and needed libs for inclusion in the server&lt;br /&gt;server_ldflags:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @echo $(OBJECTS) $(LIBS)&lt;br /&gt;&lt;br /&gt;# List module's includes for inclusion in the server&lt;br /&gt;server_cflags:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @echo $(INCLUDE_FLAGS)&lt;br /&gt;&lt;br /&gt;# Build microservices&lt;br /&gt;microservices:&amp;nbsp;&amp;nbsp;&amp;nbsp; print_cflags $(OBJECTS)&lt;br /&gt;&lt;br /&gt;# Build client additions&lt;br /&gt;client:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @true&lt;br /&gt;&lt;br /&gt;# Build server additions&lt;br /&gt;server:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @true&lt;br /&gt;&lt;br /&gt;# Build rules&lt;br /&gt;rules:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @true&lt;br /&gt;&lt;br /&gt;# Clean&lt;br /&gt;clean:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @echo "Clean image module..."&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; rm -rf $(MSObjDir)/*.o&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# Show compile flags&lt;br /&gt;print_cflags:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @echo "Compile flags:"&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @echo "&amp;nbsp;&amp;nbsp;&amp;nbsp; $(CFLAGS_OPTIONS)"&lt;br /&gt;&lt;br /&gt;# Compile targets&lt;br /&gt;#&lt;br /&gt;$(OBJECTS): $(MSObjDir)/%.o: $(MSSrcDir)/%.c $(DEPEND)&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @echo "Compile image module `basename $@`..."&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; @$(CC) -c $(CFLAGS) -o $@ $&amp;lt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;info.txt&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;Name:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; HCC&lt;br /&gt;Brief:&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; HCC Test microservice&lt;br /&gt;Description:&amp;nbsp;&amp;nbsp;&amp;nbsp; HCC Test microservice.&lt;br /&gt;Dependencies:&lt;br /&gt;Enabled:&amp;nbsp;&amp;nbsp;&amp;nbsp; yes&lt;br /&gt;Creator:&amp;nbsp;&amp;nbsp;&amp;nbsp; Ashu Guru&lt;br /&gt;Created:&amp;nbsp;&amp;nbsp;&amp;nbsp; December 2011&lt;br /&gt;License:&amp;nbsp;&amp;nbsp;&amp;nbsp; BSD&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;In the next step I will define the micro-service header and micro-service table files so that the iRODS can be configured with the new micro-service. This is done in the folder microservices/include. In this example&amp;nbsp; there is no header for this code so I have left the header file blank;&amp;nbsp; in the micro-service table file I have the entry for the table definition.&amp;nbsp; The specifics to note below are that the first argument is the label of the micro-service, the second argument is the count of input arguments&amp;nbsp; (do not count the ruleExecInfo _t argument) of the msi interface and the third argument is the name of the msi interface function.&lt;br /&gt;&lt;br /&gt;File microservices/include/microservices.table&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;{ "msiWritemessage",2,(funcPtr) msiWritemessage },&amp;nbsp;&amp;nbsp; &lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Following is the directory tree structure for the HCC module that I have so far:&lt;br /&gt;&lt;pre&gt;bash-4.1$ pwd&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;/opt/iRODS/modules&lt;br /&gt;bash-4.1$ tree HCC&lt;br /&gt;HCC&lt;br /&gt;├── clients&lt;br /&gt;├── info.txt&lt;br /&gt;├── lib&lt;br /&gt;├── Makefile&lt;br /&gt;├── microservices&lt;br /&gt;│&amp;nbsp;&amp;nbsp; ├── include&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │&amp;nbsp;&amp;nbsp; ├── microservices.header&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │&amp;nbsp;&amp;nbsp; ├── microservices.table&lt;br /&gt;│&amp;nbsp;&amp;nbsp; ├── obj&lt;br /&gt;│&amp;nbsp;&amp;nbsp; └── src&lt;br /&gt;│&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ├── test.c&lt;br /&gt;├── rules&lt;br /&gt;└── servers&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Next I will make an entry for enabling the new module (this micro-service), this is done in the file &lt;i&gt;~irods/config/config.mk&lt;/i&gt; so that the iRODS Makefile can include the new micro-service for build. To do this simply add the module folder name (in my case HCC) to the variable &lt;i&gt;MODULES&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="font-size: large;"&gt;Compile and test&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;cd ~irods/modules/&amp;lt;YOURMODULENAME&amp;gt;&lt;br /&gt;make&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The above commands should result in creation of an object file in the micro-service/obj folder. I am going to test the micro-service manually first, to accomplish this I will create a client side rule file in the folder &lt;i&gt;~irods/ clients/icommands/test/rules&lt;/i&gt;. I have named the file aguru.ir and following are the contents of the file:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;aguruTest||msiWritemessage(*A,*B)|nop&lt;br /&gt;*A=helloworld%*B=testing&lt;br /&gt;&lt;/pre&gt;&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The first line in file&amp;nbsp; is the rules definition and the second line are the input parameters. To test the micro-service I will&amp;nbsp; invoke the micro-service which will then write a message to the system log (see figure below).&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;div style="text-align: justify;"&gt;&lt;img height="363px;" src="https://lh6.googleusercontent.com/PvAfaBLKo7o0OayethKj9p71V-a_sQA0rZHS4GBWk8VW_3gGK8dPi5g3Jp1f_0E5vZnCCU4XFv2-y1XA5MXaXaG6sNF2sOUbIVryR1jb6M41H0vGxWs" width="516px;" /&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;b&gt;&lt;span style="font-size: large;"&gt;Recompile iRODS&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Before this step I must make the entries for the headers and the msi table in the iRODS main micro-service action table (i.e. file ~irods/server/re/include/reAction.h). This should be done using the following commands:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;rm server/re/include/reAction.h&lt;br /&gt;make reaction&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;However, I had to manually add the code segment below to the file &lt;i&gt;server/re/include/reAction.h&lt;/i&gt; file to accomplish that:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;int msiWritemessage(msParam_t *mParg1, msParam_t *mParg2,&amp;nbsp; ruleExecInfo_t *rei);&lt;br /&gt;&lt;/pre&gt;Finally, recompile iRODS&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;cd ~irods&lt;br /&gt;make test_flags&lt;br /&gt;make modules&lt;br /&gt;./irodsctl stop&lt;br /&gt;make clean&lt;br /&gt;make&lt;br /&gt;./irodsctl start&lt;br /&gt;./irodsctl status&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;&lt;span style="font-size: large;"&gt;Register Micro-service and Test&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;In this step we define a rule that will trigger the micro-service when a new data object is uploaded to iRODS. Open the file &lt;i&gt;~irods/server/config/reConfigs/core.re &lt;/i&gt;and add the following line&amp;nbsp; the Test Rules section.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;acPostProcForPut {msiWritemessage("HelloWorld","String 2"); }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;That is it… if now I put (iput) any file into iRODS a message is added to the /var/log/messages file on the iRODS server. Please note that the above rule is not filtering a particular occurrence but is a catchall rule that applies to all put events.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;References:&lt;/b&gt;&lt;br /&gt;&lt;a href="https://www.irods.org/"&gt;https://www.irods.org/&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.wrg.york.ac.uk/iread/compiling-and-running-irods-with-micros-services"&gt;http://www.wrg.york.ac.uk/iread/compiling-and-running-irods-with-micros-services&lt;/a&gt;&lt;br /&gt;&lt;a href="http://technical.bestgrid.org/index.php/IRODS_deployment_plan"&gt;http://technical.bestgrid.org/index.php/IRODS_deployment_plan&lt;/a&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Cambria; font-size: 16px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Cambria; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Cambria; font-size: 11px; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-4646684131624655729?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/4646684131624655729/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/12/simple-irods-micro-service.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/4646684131624655729'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/4646684131624655729'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/12/simple-irods-micro-service.html' title='A simple iRODS Micro-Service'/><author><name>Ashu Guru</name><uri>http://www.blogger.com/profile/02470446389774568545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-VKhZCQt1S5g/TfLVb8SsSBI/AAAAAAAAAK4/mMzuh0xz_zs/s220/image1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-1846261836896771050</id><published>2011-12-15T11:47:00.001-08:00</published><updated>2011-12-15T12:06:20.242-08:00</updated><title type='text'>How to create openstack controller</title><content type='html'>As before, the "official" instructions on which our procedure is based are here:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://docs.openstack.org/cactus/openstack-compute/admin/content/installing-openstack-compute-on-rhel6.html"&gt;&lt;span style="font-family: courier new;"&gt;http://docs.openstack.org/cactus/openstack-compute/admin/content/installing-openstack-compute-on-rhel6.html&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;First setup the repository:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;wget http://yum.griddynamics.net/yum/cactus/openstack/openstack-repo-2011.2-1.el6.noarch.rpm&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;rpm -ivh openstack-repo-2011.2-1.el6.noarch.rpm&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Then install openstack and dependencies&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;yum install libvirt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;chkconfig libvirtd on&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;/etc/init.d/libvirtd start&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;yum install euca2ools openstack&lt;br /&gt;nova-{api,compute,network,objectstore,scheduler,volume} openstack-nova-cc-config openstack-glance&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Start services:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;service mysqld start&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;chkconfig mysqld on&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;service rabbitmq-server start&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;chkconfig rabbitmq-server on&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Setup database authorisations. First set up root password:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;mysqladmin -uroot password &lt;rootpwd&gt;&lt;/rootpwd&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Now, to automate the procedure create an executable shell script&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;openstack-db-setup.sh&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;with the following content (fill the relevant user name and password fields as well as the IP's):&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;#!/bin/bash&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;DB_NAME=nova&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;DB_USER=&lt;dbuser&gt;&lt;/dbuser&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;DB_PASS=&lt;dbpassword&gt;&lt;/dbpassword&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;PWD=&lt;rootpassword&gt;&lt;/rootpassword&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;#CC_HOST="A.B.C.D" # IPv4 address&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;CC_HOST="130.199.148.53" # IPv4 address, fill your own&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;#HOSTS='node1 node2 node3' # compute nodes list&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;HOSTS='130.199.148.54' # compute nodes list, fill your own&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;mysqladmin -uroot -p$PWD -f drop nova&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;mysqladmin -uroot -p$PWD create nova&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;for h in $HOSTS localhost; do&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;        echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO '$DB_USER'@'$h' IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;done&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO $DB_USER IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;echo "GRANT ALL PRIVILEGES ON $DB_NAME.* TO root IDENTIFIED BY '$DB_PASS';" | mysql -uroot -p$DB_PASS mysql&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;And now execute this script:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;./openstack-db-setup.sh&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Create db schema&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;nova-manage db sync&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now comes point which is not in the "official" instructions. The installation will not work unless you patch your python:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;patch -p0 &amp;lt; rhel6-nova-network-patch.diff&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Create logical volumes:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;lvcreate -L 1G --name test nova-volumes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;For your convenience create an openstack startup shell script&lt;span style="font-family: courier new;"&gt; openstack-init.sh&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here is its content:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;#!/bin/bash&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;for n in api compute network objectstore scheduler volume; do  &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;    service openstack-nova-$n $@; &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;done&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;service openstack-glance-api $@&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;And finally we are ready to start openstack:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;openstack-init.sh start&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;With fingers crossed you should get&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova API Server:                        [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Compute Worker:                    [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Network Controller:                [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Object Storage:                    [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Scheduler:                         [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Volume Worker:                     [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Glance API Server:                      [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now we need to configure and customize the installation which is another story for another day...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;./openstack-init.sh start&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If everything goes fine&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova API Server:                        [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Compute Worker:                    [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Network Controller:                [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Object Storage:                    [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Scheduler:                         [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Volume Worker:                     [  OK  ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Glance API Server:                      [  OK  ]&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-1846261836896771050?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/1846261836896771050/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/12/how-to-create-openstack-controller.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/1846261836896771050'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/1846261836896771050'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/12/how-to-create-openstack-controller.html' title='How to create openstack controller'/><author><name>TomW</name><uri>http://www.blogger.com/profile/17076069319144757304</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-4953757036561754700</id><published>2011-12-15T11:40:00.001-08:00</published><updated>2011-12-15T11:46:58.207-08:00</updated><title type='text'>How to create openstack worker node</title><content type='html'>The "official" instructions how to install openstack components are located here:&lt;br /&gt;&lt;br /&gt;http://docs.openstack.org/cactus/openstack-compute/admin/content/installing-openstack-compute-on-rhel6.html&lt;br /&gt;&lt;br /&gt;Unfortunately they are not very clear and miss some key points. Below is summary of our installation procedure.&lt;br /&gt;&lt;br /&gt;First of all, let us install worker node.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;wget http://yum.griddynamics.net/yum/cactus/openstack/openstack-repo-2011.2-1.el6.noarch.rpm&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;rpm -ivh openstack-repo-2011.2-1.el6.noarch.rpm&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;yum install libvirt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;chkconfig libvirtd on&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;/etc/init.d/libvirtd start&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;yum install openstack-nova-compute openstack-nova-compute-config&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;service openstack-nova-compute start&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If everything goes fine you should see&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;Starting OpenStack Nova Compute Worker:                    [  OK  ]&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-4953757036561754700?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/4953757036561754700/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/12/how-to-create-openstack-worker-node.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/4953757036561754700'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/4953757036561754700'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/12/how-to-create-openstack-worker-node.html' title='How to create openstack worker node'/><author><name>TomW</name><uri>http://www.blogger.com/profile/17076069319144757304</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-8008964395474516425</id><published>2011-12-08T17:03:00.000-08:00</published><updated>2011-12-08T17:03:39.877-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Condor'/><category scheme='http://www.blogger.com/atom/ns#' term='networking'/><title type='text'>Network Accounting for Condor</title><content type='html'>It's been a long time since the &lt;a href="http://osgtech.blogspot.com/2011/09/per-batch-job-network-statistics.html"&gt;August post&lt;/a&gt; describing how to set up manual network accounting for a process. &amp;nbsp;We now have a solution integrated into Condor and available &lt;a href="https://github.com/bbockelm/condor-network-accounting"&gt;on github&lt;/a&gt;. &amp;nbsp;It requires a bit to understand how it works, so I've put together a series of diagrams to illustrate it.&lt;br /&gt;&lt;br /&gt;First, we start off with the lowly &lt;i&gt;condor_starter&lt;/i&gt; on any worker node with an network connection (to simplify things, I didn't draw the other condor processes involved):&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;span id="goog_513466619"&gt;&lt;/span&gt;&lt;span id="goog_513466620"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-BtJTZ5Wmg5E/TtqliF1fPGI/AAAAAAAAAdQ/0-6Z7ZqE5YY/s1600/Network+Namespaces+Illustration+1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-BtJTZ5Wmg5E/TtqliF1fPGI/AAAAAAAAAdQ/0-6Z7ZqE5YY/s1600/Network+Namespaces+Illustration+1.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;By default, all processes on the node are in the same network namespace (labelled the "System Network Namespace" in this diagram). &amp;nbsp;We denote the network interface with a box, and assume it has address 192.168.0.1.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Next, the starter will create a pair of virtual ethernet devices. &amp;nbsp;We will refer to them as pipe devices, because any byte written into one will come out of the other - just how a venerable Unix pipe works:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-GcVmvkaUH2I/TtqmGAbkBoI/AAAAAAAAAdY/mIVqQVfNUHM/s1600/Network+Namespaces+Illustration+2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-GcVmvkaUH2I/TtqmGAbkBoI/AAAAAAAAAdY/mIVqQVfNUHM/s1600/Network+Namespaces+Illustration+2.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;By default, the network pipes are in a down state and have no IP address associated with them. &amp;nbsp;Not very useful! &amp;nbsp;At this point, we have some decisions to make: how should the network pipe device be presented to the network? &amp;nbsp;Should it be networked at layer 3, using NAT to route packets? &amp;nbsp;Or should we bridge it at layer 2, allowing the device to have a public IP address?&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Really, it's up to the site, but we assume most sites will want to take the NAT approach: the public IP address might seem useful, but would require a public IP for each job. &amp;nbsp;To allow customization, all the routing is done by a helper script, but provide a default implementation for NAT. &amp;nbsp;The script:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;Takes two arguments, a unique "job identifier" and the name of the network pipe device.&lt;/li&gt;&lt;li&gt;Is responsible for setting up any routing required for the device.&lt;/li&gt;&lt;li&gt;Must create an iptables chain using the same name of the "job identifier".&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Each rule in the chain will record the number of bytes matched; at the end of the job, these will be reported in the job ClassAd using an attribute name identical to the comment on the rule.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;On stdout, returns the IP address the internal network pipe should use.&lt;/li&gt;&lt;/ul&gt;Additionally, the Condor provides a cleanup script does the inverse of the setup script. &amp;nbsp;The result looks something like this:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-5vX_SFAQkSY/TuFS4jDX3HI/AAAAAAAAAds/4aisRs97iEA/s1600/Network+Namespaces+Illustration+3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-5vX_SFAQkSY/TuFS4jDX3HI/AAAAAAAAAds/4aisRs97iEA/s1600/Network+Namespaces+Illustration+3.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Next, the starter forks a separate process in a new network namespace using the clone() call with the CLONE_NEWNET flag. &amp;nbsp;Notice that, by default, no network devices are accessible in the new namespace:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-pTGUfwub6GE/TuFWqWn20GI/AAAAAAAAAd8/aLZy8z9X4N0/s1600/Network+Namespaces+Illustration+4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-pTGUfwub6GE/TuFWqWn20GI/AAAAAAAAAd8/aLZy8z9X4N0/s1600/Network+Namespaces+Illustration+4.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Next, the external starter will pass one side of the pipe to the other namespace; the internal stater will do some minimal configuration of the device (default route, IP address, set the device to the "up" status):&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-rcms2kWSNk4/TuFXwH0_LsI/AAAAAAAAAeE/P8o8wsZf2Bk/s1600/Network+Namespaces+Illustration+5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-rcms2kWSNk4/TuFXwH0_LsI/AAAAAAAAAeE/P8o8wsZf2Bk/s1600/Network+Namespaces+Illustration+5.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Finally, the starter exec's to the job. &amp;nbsp;Whenever the job does any network operations, the bytes are routed via the internal network pipe, come out the external network pipe, and then are NAT'd to the physical network device before exiting the machine.&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-2JlhOJkWPWk/TuFbBJ-f4sI/AAAAAAAAAeM/dDy0tHDxEUc/s1600/Network+Namespaces+Illustration+6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/-2JlhOJkWPWk/TuFbBJ-f4sI/AAAAAAAAAeM/dDy0tHDxEUc/s1600/Network+Namespaces+Illustration+6.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;As mentioned, the whole point of the exercise is to do network accounting. &amp;nbsp;Since all packets go through one device, Condor can read out all the activity via iptables. &amp;nbsp;The "helper script" above will create a unique chain per job. &amp;nbsp;This allows some level of flexibility; for example, the chain below allows us to distinguish between on-campus and off-campus packets:&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;Chain JOB_12345 (2 references)&lt;br /&gt;&amp;nbsp;pkts bytes target &amp;nbsp; &amp;nbsp; prot opt in &amp;nbsp; &amp;nbsp; out &amp;nbsp; &amp;nbsp; source &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; destination &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;br /&gt;&amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; 0 ACCEPT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;veth0 &amp;nbsp;em1 &amp;nbsp; &amp;nbsp; anywhere &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 129.93.0.0/16 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;/* OutgoingInternal */&lt;br /&gt;&amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; 0 ACCEPT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;veth0 &amp;nbsp;em1 &amp;nbsp; &amp;nbsp; anywhere &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;!129.93.0.0/16 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;/* OutgoingExternal */&lt;br /&gt;&amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; 0 ACCEPT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;em1 &amp;nbsp; &amp;nbsp;veth0 &amp;nbsp; 129.93.0.0/16 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;anywhere &amp;nbsp; &amp;nbsp; &amp;nbsp;   &amp;nbsp; &amp;nbsp; state RELATED,ESTABLISHED /* IncomingInternal */&lt;br /&gt;&amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; 0 ACCEPT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;em1 &amp;nbsp; &amp;nbsp;veth0 &amp;nbsp;!129.93.0.0/16 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;anywhere &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; state RELATED,ESTABLISHED /* IncomingExternal */&lt;br /&gt;&amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; 0 REJECT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;any &amp;nbsp; &amp;nbsp;any &amp;nbsp; &amp;nbsp; anywhere &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; anywhere &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; reject-with icmp-port-unreachable&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Thus, the resulting ClassAd history from this job will have an attribute for &lt;i&gt;NetworkOutgoingInternal&lt;/i&gt;, &lt;i&gt;NetworkOutgoingExternal&lt;/i&gt;, &lt;i&gt;NetworkIncomingInternal&lt;/i&gt;, and &lt;i&gt;NetworkIncomingInternal&lt;/i&gt;. &amp;nbsp;We have an updated Condor Gratia probe that looks for &lt;i&gt;Network*&lt;/i&gt; attributes and reports them appropriately to the accounting database.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;Thus, we have byte-level network, allowing us to answer the age-old question of "how much would a CMS T2 cost on Amazon EC2?". &amp;nbsp;Or perhaps we could answer "how much is a currently running job going to cost me?" Matt has pointed out the network setup callout could be used to implement security zones, isolating (or QoS'ing) jobs of certain users at the network level. &amp;nbsp;There are quite a few possibilities! &amp;nbsp;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;We'll definitely be returning to this work mid-2012 when the local T2 is based on SL6, and this patch can be put into production. &amp;nbsp;There will be some further engagement with the Condor team to see if they're interested in taking the patch. &amp;nbsp;The Gratia probe work to manage network information will be interesting upstream too. &amp;nbsp;Finally, I encourage interested readers to take a look at the github branch. &amp;nbsp;The patch itself is a tour-de-force of several dark corners of Linux systems programming (involves using clone, synchronization between processes with pipes, sending messages to the kernel via netlink to configure the routing, and reading out iptables configurations using C). &amp;nbsp;It was very rewarding to implement!&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-8008964395474516425?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/8008964395474516425/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/12/network-accounting-for-condor.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/8008964395474516425'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/8008964395474516425'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/12/network-accounting-for-condor.html' title='Network Accounting for Condor'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-BtJTZ5Wmg5E/TtqliF1fPGI/AAAAAAAAAdQ/0-6Z7ZqE5YY/s72-c/Network+Namespaces+Illustration+1.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-2079248913615855270</id><published>2011-12-01T17:46:00.000-08:00</published><updated>2011-12-01T17:46:58.275-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='glexec'/><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='Condor'/><category scheme='http://www.blogger.com/atom/ns#' term='gums'/><title type='text'>Details on glexec improvements</title><content type='html'>&lt;a href="http://osgtech.blogspot.com/2011/11/improving-glexec-enabled-life.html"&gt;My last blog post&lt;/a&gt; gave a quick overview of why &lt;i&gt;glexec&lt;/i&gt; exists, what issues folks run into, and what we did to improve it. &amp;nbsp;Let's go into some details.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;How Condor Update Works&lt;/span&gt;&lt;br /&gt;The &lt;b&gt;lcmaps-plugin-condor-update&lt;/b&gt; package contains the modules necessary to advertise the payload certificate of the last glexec invocation in the pilot's ClassAd. &amp;nbsp;The concept is simple - the implementation is a bit tricky.&lt;br /&gt;&lt;br /&gt;For a long time, Condor has had a command-line tool called &lt;i&gt;condor_advertise&lt;/i&gt;&amp;nbsp;for awhile; it allows an admin to hand-advertise updates to ads in the collector. &amp;nbsp;Unfortunately, that's not quite what we need here: we want to update the &lt;b&gt;job&lt;/b&gt;&amp;nbsp;ad in the &lt;b&gt;schedd&lt;/b&gt;, while condor_advertise typically updates the &lt;b&gt;machine&lt;/b&gt;&amp;nbsp;ad in the &lt;b&gt;collector&lt;/b&gt;. &amp;nbsp;Close, but no cigar.&lt;br /&gt;&lt;br /&gt;There's a lesser-known utility called &lt;i&gt;condor_chirp&lt;/i&gt; that we can use. &amp;nbsp;Typically,  &lt;i&gt;condor_chirp&lt;/i&gt; is used to do I/O between the schedd and the starter (for example, you can pull/push files on demand in the middle of the job), but it can also update the job's ad in the schedd. &amp;nbsp;The syntax is simple:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;condor_chirp ATTR_NAME ATTR_VAL&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(&lt;a href="http://spinningmatt.wordpress.com/2011/02/27/service-as-a-job-the-tomcat-app-server/"&gt;look at the clever things Matt does with condor_chirp&lt;/a&gt;). &amp;nbsp;As condor_chirp allows additional access to the schedd, the user must explicitly request it in the job ad. &amp;nbsp;If you want to try it out, you must add the following line into your submit file:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;+WantIOProxy=TRUE&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To work, chirp must know how to contact the starter and have access to the "magic cookie"; these are located inside the &lt;b&gt;$_CONDOR_SCRATCH_DIR&lt;/b&gt;, as set by Condor in the initial batch process. &amp;nbsp;As the glexec plugin runs as root (glexec must be setuid root to launch a process as a different UID), we must guard against being fooled by the invoking user.&lt;br /&gt;Accordingly, the plugin uses &lt;b&gt;/proc&lt;/b&gt; to read the parentage of the process tree until it finds a process owned by root. &amp;nbsp;If this is not init, it is assumed the process is the condor_starter, and the job's &lt;b&gt;$_CONDOR_SCRATCH_DIR&lt;/b&gt; can be deduced from the &lt;b&gt;$CWD &lt;/b&gt;and the PID of the starter. &amp;nbsp;Since we only rely on information from root-owned processes, we can be fairly sure this is the correct scratch directory. &amp;nbsp;As a further safeguard, before invoking &lt;i&gt;condor_chirp&lt;/i&gt;, the plugin drops privilege to that of the invoking user. &amp;nbsp;Along with the other security guarantees provided by &lt;i&gt;glexec&lt;/i&gt;, we have confidence that we are reading the correct chirp configuration and are not allowing the invoker to increase its privileges.&lt;br /&gt;&lt;br /&gt;Once we know how to invoke &lt;i&gt;condor_chirp&lt;/i&gt;, the rest of the process is all downhill. &amp;nbsp;&lt;i&gt;glexec&lt;/i&gt; internally knows the payload's DN, the payload Unix user, and does the equivalent of the following:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;condor_chirp set_job_attr glexec_user "hcc"&lt;br /&gt;condor_chirp set_job_attr glexec_x509userproxysubject "/DC=org/DC=cilogon/C=US/O=University of Nebraska-Lincoln/CN=Brian Bockelman A621"&lt;br /&gt;condor_chirp set_job_attr glexec_time 1322761868&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;condor_chirp writes the data into the starter, which then updates the shadow, then the schedd (&lt;a href="https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=JobClassAdFlow"&gt;some of the gory details are covered in the Condor wiki&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;The diagram below illustrates the data flow:&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-EV7fEqSye6s/TtgiO2sLfII/AAAAAAAAAcw/L4Ilyk8k1JM/s1600/glexec-condor-update.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-EV7fEqSye6s/TtgiO2sLfII/AAAAAAAAAcw/L4Ilyk8k1JM/s1600/glexec-condor-update.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Putting this into Play&lt;/span&gt;&lt;br /&gt;If you really want to get messy, you can check out the source code from Subversion at:&lt;br /&gt;&lt;pre&gt;svn://t2.unl.edu/brian/lcmaps-plugins-condor-update&lt;/pre&gt;(&lt;a href="http://t2.unl.edu:8094/browser/lcmaps-plugins-condor-update"&gt;web view&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;The current version of the plugin is 0.0.2. &amp;nbsp;It's &lt;a href="https://koji-hub.batlab.org/koji/buildinfo?buildID=616"&gt;available in Koji&lt;/a&gt;, or via yum in the osg-development repository:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;yum install --enablerepo=osg-development lcmaps-plugins-condor-update&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(you must already have the &lt;i&gt;osg-release&lt;/i&gt; RPM installed and &lt;i&gt;glexec&lt;/i&gt; otherwise configured).&lt;br /&gt;&lt;br /&gt;After installing it, you need to update the &lt;b&gt;/etc/lcmaps.db&lt;/b&gt;&amp;nbsp;configuration file on the worker node to invoke the condor-update module. &amp;nbsp;In the top half, I add:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;condor_updates = "lcmaps_condor_update.mod"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Then, I add &lt;i&gt;condor-update&lt;/i&gt; to the &lt;i&gt;glexec&lt;/i&gt; policy:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;glexec:&lt;br /&gt;&lt;br /&gt;verifyproxy -&amp;gt; gumsclient&lt;br /&gt;gumsclient -&amp;gt; condor_updates&lt;br /&gt;condor_updates -&amp;gt; tracking&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;Note we use the "tracking" module locally; most sites will use the "glexec-tracking" module. &amp;nbsp;Pick the appropriate one.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Finally, you need to turn on the I/O proxy in the Condor submit file. &amp;nbsp;We do this by editing &lt;b&gt;condor.pm&lt;/b&gt;&amp;nbsp; (for RPMs, located in &lt;b&gt;/usr/lib/perl5/vendor_perl/5.8.8/Globus/GRAM/JobManager/condor.pm&lt;/b&gt;). &amp;nbsp;We add the following line into the &lt;i&gt;submit&lt;/i&gt; routine, right before &lt;b&gt;queue&lt;/b&gt; is added to the script file:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;print SCRIPT_FILE "+WantIOProxy=TRUE\n";&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;All new incoming jobs will get this attribute; any &lt;i&gt;glexec&lt;/i&gt; invocations they do will be reflected at the CE!&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;GUMS and Worker Node Certificates&lt;/span&gt;&lt;br /&gt;&lt;div&gt;To map a certificate to a Unix user, &lt;i&gt;glexec&lt;/i&gt; calls out to the GUMS server using XACML with a grid-interoperable profile. &amp;nbsp;In the XACML callout, GUMS is given the payload's DN and VOMS attributes. &amp;nbsp;The same library (LCMAPS/SCAS-client) and protocol can also make callouts directly to SCAS, more commonly used in Europe.&lt;br /&gt;&lt;br /&gt;GUMS is a powerful and flexible authorization tool; one feature is that it allows different mappings based on the originating hostname. &amp;nbsp;For example, if desired, my certificate could map to user &lt;b&gt;hcc&lt;/b&gt;&amp;nbsp;at &lt;i&gt;red.unl.edu&lt;/i&gt; but map to &lt;b&gt;cmsprod&lt;/b&gt; at &lt;i&gt;ff-grid.unl.edu&lt;/i&gt;. &amp;nbsp;To prevent "just anyone" from probing the GUMS server, GUMS requires the client to present X509 a certificate (in this case, the hostcert); it takes the hostname from the client's certificate.&lt;br /&gt;&lt;br /&gt;This has the unfortunate side-effect of requiring a host certificate on every node that invokes GUMS; OK for the CE (100 in the OSG), but not for glexec on the worker nodes (thousands on the OSG).&lt;br /&gt;&lt;br /&gt;When &lt;i&gt;glexec&lt;/i&gt; is invoked in EGI, SCAS is invoked using the pilot certificate for HTTPS and information about the payload certificate in the XACML callout; this requires no worker node host certificate.&lt;br /&gt;&lt;br /&gt;To replicate how &lt;i&gt;glexec&lt;/i&gt; works in EGI, we had to develop a small patch to GUMS. &amp;nbsp;When the pilot certificate is used for authentication, the pilot's DN is recorded to the logs (so we know who is invoking GUMS), but the host name is self-reported in the XACML callout. &amp;nbsp;As the authentication is still performed, we believe this relaxing of the security model is acceptable.&lt;br /&gt;&lt;br /&gt;A &lt;a href="https://koji-hub.batlab.org/koji/buildinfo?buildID=944"&gt;patched, working version of GUMS&lt;/a&gt; can be found in Koji and is available in the osg-development repository. &amp;nbsp;It will still be a few months before the RPM-based GUMS install is fully documented and released, however.&lt;br /&gt;&lt;br /&gt;Once installed, two changes need to be made at the server:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Do all hostname mappings based on "DN" in the web interface, not the "CN".&lt;/li&gt;&lt;li&gt;Any group of users (for example, /cms/Role=pilot) that want to invoke GUMS must have "read all" access, not just "read self".&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Further, &lt;b&gt;/etc/lcmaps.db&lt;/b&gt; needs to be changed to &lt;u&gt;remove&lt;/u&gt;&amp;nbsp;the following lines from the gumsclient module:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;"-cert &amp;nbsp; /etc/grid-security/hostcert.pem"&lt;br /&gt;"-key &amp;nbsp; &amp;nbsp;/etc/grid-security/hostkey.pem"&lt;br /&gt;"--cert-owner root"&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;This will be all automated going forward - but all should help remove some of the pain in deploying &lt;i&gt;glexec&lt;/i&gt;!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-2079248913615855270?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/2079248913615855270/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/12/details-on-glexec-improvements.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/2079248913615855270'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/2079248913615855270'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/12/details-on-glexec-improvements.html' title='Details on glexec improvements'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-EV7fEqSye6s/TtgiO2sLfII/AAAAAAAAAcw/L4Ilyk8k1JM/s72-c/glexec-condor-update.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-5769556833312037616</id><published>2011-11-11T08:32:00.000-08:00</published><updated>2011-11-11T08:32:28.848-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='glexec'/><title type='text'>Improving the glexec-enabled life</title><content type='html'>&lt;div class="p1"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;Pilot-based workflow management systems have had a dramatic transformation of how we view the grid today.&amp;nbsp; Instead of queueing a job (the "payload") in a workflow onto a site on a grid, these systems send an "empty" job that starts up, then downloads and starts the payload from from a central endpoint.&amp;nbsp; In CS terms, it switches from a model of "work delegation" to "resource allocation".&amp;nbsp; By allocating the resource (i.e., starting the pilot job) prior to delegating work, users no longer have to know the vagaries/failure modes of direct grid submission and don't have to pay the price of sending their payloads to a busy site!&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;In short, pilot jobs make the grid much better.&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;However, like most concepts, pilot jobs are a trade-off: they make life easier for users, but harder for security folks and sysadmins.&amp;nbsp; Pilots are sent using one certificate, but payloads are run under a different identity. &amp;nbsp;If the payload job wants to act on behalf of the user, it needs to bring the user's grid credentials to the worker node. &amp;nbsp;[Side note: this is actually an interesting assumption. &amp;nbsp;The &lt;a href="http://iopscience.iop.org/1742-6596/119/6/062036"&gt;PanDA pilot system&lt;/a&gt;, heavily utilized by ATLAS, does not bring credentials to the worker node. &amp;nbsp;This simplifies this problem, but opens up a different set of concerns.] &amp;nbsp;If both pilot and payload are run as the same Unix user, the payload user can easily access the credentials (including the pilot credentials), executables, and output data of other running payloads.&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;The program &lt;a href="https://www.nikhef.nl/pub/projects/grid/gridwiki/index.php/GLExec"&gt;glexec&lt;/a&gt; is a "simple" idea to solve this problem: given a set of grid credentials, launch a process under corresponding the Unix account at the site. &amp;nbsp;For example, with credentials from the HCC VO:&lt;/div&gt;&lt;div class="p1"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p1"&gt;&lt;pre&gt;[bbockelm@brian-test ~]$ whoami&lt;br /&gt;bbockelm&lt;br /&gt;[bbockelm@brian-test ~]$ GLEXEC_CLIENT_CERT=/tmp/x509up_u1221 /usr/sbin/glexec &lt;br /&gt;/usr/bin/whoami&lt;br /&gt;hcc&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="p2"&gt;(You'll notice the invocation is not as simple as typing "glexec whoami"; it's not exactly designed for end-user invocation). &amp;nbsp;To achieve the user switching, glexec has to be &lt;a href="http://en.wikipedia.org/wiki/Setuid"&gt;setuid&lt;/a&gt; root. &amp;nbsp;Setuid binaries must be examined under a security microscope, which have unfortunately led to a slow adoption of glexec.&lt;/div&gt;&lt;div class="p2"&gt;&lt;br /&gt;The idea is that pilot jobs would wrap the payload with a call to "glexec", separating the payload from the pilot and other payloads. &amp;nbsp;From there, it goes horribly wrong.&amp;nbsp; Not wrong really - but rather things get sticky.&lt;br /&gt;&lt;br /&gt;Since the pilot and payload are both low-privileged users, the pilot doesn't have permission to clean up or kill the payload. &amp;nbsp;It must again use glexec to send signals and delete sandboxes. &amp;nbsp;The several invocations are easy to screw up (and place load on the authorization system!). &amp;nbsp;There are tricky error conditions - if authorization breaks in the middle of the job, how does the pilot clean up the payload?&lt;br /&gt;&lt;br /&gt;As the payload is a full-fledged Linux process, it can create other processes, daemonize, escape from the batch system, etc. &amp;nbsp;As &lt;a href="http://osgtech.blogspot.com/2011/06/how-your-batch-system-watches-your.html"&gt;previously&lt;/a&gt; &lt;a href="http://osgtech.blogspot.com/2011/06/part-ii-keeping-mindful-eye-on-your.html"&gt;discussed&lt;/a&gt;, the batch system - with root access - typically does a poor job tracking processes. &amp;nbsp;The pilot will be hopeless unless we provide some assistance.&lt;br /&gt;&lt;br /&gt;Glexec imposes an integration difficulty at some sites. &amp;nbsp;There are popular cron scripts that kill process belonging to users on a node that aren't currently running batch system jobs. &amp;nbsp;So, if the pilot maps to "cms" and the payload maps to "cmsuser", the batch system only knows about "cms", and the cronjob will kill all processes belonging to "cmsuser". &amp;nbsp;We lost quite a few jobs at some sites before we figured this out!&lt;br /&gt;&lt;br /&gt;Site admins manage the cluster via the batch system. &amp;nbsp;Since the payload is invisible to the batch system, we're unable to kill jobs from a user with batch system tools (condor_rm, qdel). &amp;nbsp;In fact, if we get an email from a user asking for help understanding their jobs, we can't even easily find where the job is running! &amp;nbsp;Site admins have to ssh into each worker node and examine the running jobs; a process that is simply medieval.&lt;br /&gt;&lt;br /&gt;Finally, on the OSG, invoking the authorization system requires host certificate credentials. &amp;nbsp;This is not a problem when host certs are needed for a handful of CEs at the site, but explodes when glexec is run on each worker node. &amp;nbsp;This is a piece of unique state on the worker nodes for sites to manage, adding to the glexec headache.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;We're the Government. &amp;nbsp;We're here to help.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The OSG Technology group has decided to tackle the three biggest site-admin usability issues in glexec:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;b&gt;Batch system integration&lt;/b&gt;: The Condor batch system provides the ability for running jobs to update the submit node with arbitrary status. &amp;nbsp;We have developed a plugin that updates the job's ClassAd with the payload's DN whenever glexec is invoked.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Process tracking&lt;/b&gt;: There is an existing glexec plugin to do process tracking. &amp;nbsp;However, this requires a admin to set up secondary GID ranges (an administration headache) and suffers the previously-documented process tracking issues. &amp;nbsp;We will port the ProcPolice daemon over to the glexec plugin framework.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Worker node certificates&lt;/b&gt;: We propose to fix this via improvements to GUMS, allowing the mappings to be performed based on the presence of "Role=pilot" VOMS extension in the pilot certificate.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;The plugins in (1) and (2) have been prototyped, and are available in the osg-development repository as "lcmaps-plugins-condor-update" and "lcmaps-plugins-process-tracking", respectively. &amp;nbsp;The third item is currently cooking.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The "lcmaps-plugins-condor-update" is especially useful, as it's a brand-new capability as opposed to an improvement. &amp;nbsp;It &amp;nbsp;advertises three attributes in the job's ClassAd:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;glexec_x509userproxysubject&lt;/b&gt;: The DN of the payload user.&lt;/li&gt;&lt;li&gt;&lt;b&gt;glexec_user&lt;/b&gt;: The Unix username for the payload.&lt;/li&gt;&lt;li&gt;&lt;b&gt;glexec_time&lt;/b&gt;: The Unix time when glexec was invoked.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;We can then use it to filter and locate jobs. &amp;nbsp;For example, if a user named Ian complains his jobs are running slowly, we could locate a few with the following command:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;pre&gt;[bbockelm@t3-sl5 ~]$ condor_q -g -const 'regexp("Ian", glexec_x509userproxysubject)' -format '%s ' ClusterId -format '%s\n' RemoteHost | head&lt;br /&gt;868341 slot6@red-d11n10.red.hcc.unl.edu&lt;br /&gt;868343 slot7@node238.red.hcc.unl.edu&lt;br /&gt;868358 slot6@red-d11n9.red.hcc.unl.edu&lt;br /&gt;868366 slot2@node239.red.hcc.unl.edu&lt;br /&gt;868373 slot3@node119.red.hcc.unl.edu&lt;br /&gt;868741 slot8@red-d9n6.red.hcc.unl.edu&lt;br /&gt;868770 slot3@red-d9n8.red.hcc.unl.edu&lt;br /&gt;868819 slot5@node109.red.hcc.unl.edu&lt;br /&gt;868820 slot4@node246.red.hcc.unl.edu&lt;br /&gt;868849 slot2@red-d11n6.red.hcc.unl.edu&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Slick!&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-5769556833312037616?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/5769556833312037616/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/11/improving-glexec-enabled-life.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/5769556833312037616'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/5769556833312037616'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/11/improving-glexec-enabled-life.html' title='Improving the glexec-enabled life'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-370365437990996479</id><published>2011-10-19T07:14:00.000-07:00</published><updated>2011-10-19T11:02:01.609-07:00</updated><title type='text'>KVM and Condor (Part 2): Condor configuration for VM Universe &amp; VM Image Staging</title><content type='html'>&lt;div style="text-align: justify;"&gt;This is Part 2 of my previous blog&amp;nbsp; &lt;a href="http://osgtech.blogspot.com/2011/08/kernel-based-virtualization-and-condor.html" target="_blank"&gt;KVM and Condor (Part 1): Creating the virtual machine&lt;/a&gt;.&amp;nbsp; In this blog I will share the steps for configuring &lt;a href="http://www.cs.wisc.edu/condor/" target="_blank"&gt;Condor&lt;/a&gt; VM Universe, in addition I will also discuss the steps involved in staging the VM disk images. It is assumed that you have a basic setup of Condor working and there is a shared file system that is accessible from each of the worker nodes.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;As a first step please make sure that the worker nodes support KVM based virtualization, if they do not, then you may use:&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;yum groupinstall "KVM"&lt;/div&gt;&lt;div style="text-align: justify;"&gt;and yum -y install kvm libvirt libvirt-python python-virtinst libvirt-client&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-size: large;"&gt;Configuring Condor for KVM&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;For Condor to support VM universe the following attributes must be set in the Condor configuration of each of the worker nodes (this may be done by modifying the the local Condor config file)&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;VM_GAHP_SERVER = $(SBIN)/condor_vm-gahp&lt;br /&gt;VM_GAHP_LOG = $(LOG)/VMGahpLog&lt;br /&gt;VM_MEMORY = 5000&lt;br /&gt;VM_TYPE = kvm&lt;br /&gt;VM_NETWORKING = true&lt;br /&gt;VM_NETWORKING_TYPE = nat&lt;br /&gt;ENABLE_URL_TRANSFERS = TRUE&lt;br /&gt;FILETRANSFER_PLUGINS = /usr/local/bin/vm-nfs-plugin&lt;br /&gt;&lt;/pre&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The explanation of the above attributes follow:&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;table border="1" cellpadding="2" cellspacing="0" style="width: 550px;"&gt;&lt;tbody&gt;&lt;tr&gt; &lt;td valign="top"&gt;&lt;b&gt;Attribute&lt;/b&gt;&lt;/td&gt; &lt;td valign="top"&gt;&lt;b&gt;Description&lt;/b&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td valign="top"&gt;VM_GAHP_SERVER &lt;/td&gt; &lt;td valign="top"&gt;The complete path and file name of the condor_vm-gahp.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td valign="top"&gt;VM_GAHP_LOG&lt;/td&gt; &lt;td valign="top"&gt;The complete path and file name of the condor_vm-gahp log.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td valign="top"&gt;VM_MEMORY&lt;/td&gt; &lt;td valign="top"&gt;A VM universe job is required to specify the memory needs for the disk image with vm_memory (Mbytes) in its job description file. On the worker node the value of the VM_MEMORY configuration is used for matching the memory requested by the job. VM_MEMORY is an integer value that specifies the maximum amount of memory in Mbytes that will be allowed for the virtual machine program.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td valign="top"&gt;VM_TYPE &lt;/td&gt; &lt;td valign="top"&gt;This attribute can have values: kvm, xen or vmware and specify the type of supported virtual machine software. &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td valign="top"&gt;VM_NETWORKING&lt;/td&gt; &lt;td valign="top"&gt;Must be set to true to support networking in the VM instances.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td valign="top"&gt;VM_NETWORKING_TYPE &lt;/td&gt; &lt;td valign="top"&gt;This is a string value describing the type of networking.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td valign="top"&gt;ENABLE_URL_TRANSFERS&lt;/td&gt; &lt;td valign="top"&gt;This is a Boolean value when True causes the condor_starter for a job to invoke all plug-ins defined by FILETRANSFER_PLUGINS when a file transfer is specified with a URL in the job description file. &lt;/td&gt;&lt;/tr&gt;&lt;tr&gt; &lt;td valign="top"&gt;FILETRANSFER_PLUGINS&lt;/td&gt; &lt;td valign="top"&gt;Is a comma separated list of absolute paths of executable(s) for plug-ins that will accomplish the task of file transfer when a job requests the transfer of an input file by specifying a URL. &lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-size: large;"&gt;The File Transfer Plugin&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;So far we have modified the configurations of the condor worker node for supporting Condor VM universe. Next I will describe a barebones FILETRANSFER_PLUGINS&amp;nbsp; executable.&amp;nbsp; I will use bash for scripting and the plugin will reside at :/usr/local/bin/vm-nfs-plugin on each of the worker nodes.&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;#!/bin/bash&lt;br /&gt;#file: /usr/local/bin/vm-nfs-plugin&lt;br /&gt;#----------------------------------------&lt;br /&gt;# Plugin Essential&lt;br /&gt;if [ "$1" = "-classad" ]&lt;br /&gt;then&lt;br /&gt;&amp;nbsp;&amp;nbsp; echo "PluginVersion = \"0.1\""&lt;br /&gt;&amp;nbsp;&amp;nbsp; echo "PluginType = \"FileTransfer\""&lt;br /&gt;&amp;nbsp;&amp;nbsp; echo "SupportedMethods = \"nfs\""&lt;br /&gt;&amp;nbsp;&amp;nbsp; exit 0&lt;br /&gt;fi&lt;br /&gt;&lt;br /&gt;#----------------------------------------&lt;br /&gt;# Variable definitions&lt;br /&gt;# transferInputstr_format='nfs:&amp;lt;abs path to (nfs hosted) inputfile file&amp;gt;:&amp;lt;basename of vminstance file&amp;gt;'&lt;br /&gt;WHICHQEMUIMG='/usr/bin/qemu-img'&lt;br /&gt;initdir=$PWD&lt;br /&gt;transferInputstr=$1&lt;br /&gt;#-------------------------------------------&lt;br /&gt;# Split the first argument to an array&lt;br /&gt;IFS=':' read -ra transferInputarray &amp;lt;&amp;lt;&amp;lt; "$transferInputstr"&lt;br /&gt;#-------------------------------------------&lt;br /&gt;#create the vm instance copy on write&lt;br /&gt;$WHICHQEMUIMG create -b ${transferInputarray[1]} -f&amp;nbsp; qcow2&amp;nbsp;&amp;nbsp; ${initdir}/${transferInputarray[2]}&lt;br /&gt;exit 0; &lt;br /&gt;&lt;/pre&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/div&gt;&lt;div style="text-align: justify;"&gt;Overall the idea behind the above script is to create a qcow2 formatted VM instance file in the condor allocated execute folder.&amp;nbsp; The details of code blocks above are listed below: &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;The “# Plugin Essential”&amp;nbsp; part of the codes is a requirement for a Condor file transfer plug-in so that a plug-in can be registered appropriately to handle file transfers based on the methods (protocols) it supports. The condor_starter daemon invokes each plug-in with a command line argument ‘-classad’ to identify the protocols that a plug-in supports, it expects that the plug-in will respond with an output of three ClassAd attributes. The first two are&amp;nbsp; fixed: PluginVersion = "0.1" and PluginType = "FileTransfer"; the third is the ClassAd attribute ‘SupportedMethods’ having a string value containing&amp;nbsp; comma separated list of the protocols that the plug-in handles. Thus, in the script above SupportedMethods = "nfs" identifies that the plug-in vm-nfs-plugin supports a user defined protocol ‘nfs’. Accordingly, the ‘nfs’ string will be matched to the protocol specification as given within a URL in the transfer_input_files command in a Condor job description file. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;For a file transfer invocation a plug-in is invoked with two arguments - the first being the URL specified in the job description file; and the second argument being the absolute path identifying where to place the transferred file.&amp;nbsp; The plug-in is expected to transfer the file and exit with a status of 0 when the transfer is successful. A non-zero status must be returned when the transfer is unsuccessful, for an unsuccessful transfer the job is placed on a hold and the job ClassAd attribute HoldReason is set with a message along&amp;nbsp; with HoldReasonSubCode which is set to the exit status of the plug-in.&amp;nbsp; &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;In the bash codes above I am only using the first argument that is received by the plugin. Further, it is decided that the value of transfer_input_files will follow the format as commented in the script&amp;nbsp; variable transferInputstr_format i.e. 'nfs:&amp;lt;abs path to (nfs hosted) inputfile file&amp;gt;:&amp;lt;basename of vminstance file&amp;gt;'. Thus after splitting the first argument received by the plugin, the plug-in creates a qcow2 image with a backing file based on the original template. &lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;Now once we send a condor reconfig&amp;nbsp; using condor_reconfig to the worker node or restart condor service (service condor restart) on the worker nodes the plug-in is ready to be used; an example submit file is shown below.&amp;nbsp;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-size: large;"&gt;Example Job Description &lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;#Condor job description file&lt;br /&gt;universe=vm&lt;br /&gt;vm_type=kvm&lt;br /&gt;executable=agurutest_vm&lt;br /&gt;vm_networking=true&lt;br /&gt;vm_no_output_vm=true&lt;br /&gt;vm_memory=1536&lt;br /&gt;#Point to the nfs location that will be available from worker node&lt;br /&gt;transfer_input_files=nfs://&amp;lt;path to the vm image&amp;gt;:vmimage.img&lt;br /&gt;vm_disk="vmimage.img:hda:rw"&lt;br /&gt;requirements= (TARGET.FileSystemDomain =!= FALSE) &amp;amp;&amp;amp; ( TARGET.VM_Type == "kvm" ) &amp;amp;&amp;amp; ( TARGET.VM_AvailNum &amp;gt; 0 ) &amp;amp;&amp;amp; ( VM_Memory &amp;gt;= 0 ) &lt;br /&gt;log=test.log&lt;br /&gt;queue 1&lt;br /&gt;&lt;/pre&gt;&lt;div style="text-align: justify;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: justify;"&gt;This submit file should invoke the vm-nfs-plugin and a VM instance should start on a worker node. You can test the VM using a shell on the worker node and then using virsh utility.&lt;br /&gt;&lt;br /&gt;That is all for this blog, in the Part 3 which is the last part of this series I will write about using file transfer plugin with Storage Resource Manager (SRM).&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-370365437990996479?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/370365437990996479/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/10/kvm-and-condor-part-2-condor.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/370365437990996479'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/370365437990996479'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/10/kvm-and-condor-part-2-condor.html' title='KVM and Condor (Part 2): Condor configuration for VM Universe &amp; VM Image Staging'/><author><name>Ashu Guru</name><uri>http://www.blogger.com/profile/02470446389774568545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-VKhZCQt1S5g/TfLVb8SsSBI/AAAAAAAAAK4/mMzuh0xz_zs/s220/image1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-5335055704434760709</id><published>2011-09-08T09:00:00.000-07:00</published><updated>2011-09-08T09:00:32.527-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='Condor'/><category scheme='http://www.blogger.com/atom/ns#' term='cgroups'/><category scheme='http://www.blogger.com/atom/ns#' term='networking'/><category scheme='http://www.blogger.com/atom/ns#' term='accounting'/><title type='text'>Per-Batch Job Network Statistics</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: large;"&gt;Introduction&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The OSG takes a fairly abstract definition of a cloud:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;A cloud is a service that provision resources on-demand for a marginal cost&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The two important pieces of this definition are "resource provisioning" and "marginal cost". &amp;nbsp;The most common cloud instance you'll run into is Amazon EC2, which provisions VMs; depending on the size of VM, the marginal cost is between $0.03 and $0.80 an hour.&lt;br /&gt;&lt;br /&gt;The EC2 charge model is actually more complicated than just VMs-per-hour. &amp;nbsp;There's additional charges for storage and network use. &amp;nbsp;In controlled experiments last year, CMS determined the largest cost of using EC2 was not the CPU time, but the network usage.&lt;br /&gt;&lt;br /&gt;This showed a glaring hole in OSG's current accounting: we only record wall and CPU time. &amp;nbsp;For the rest of other metrics - which can't be estimated accurately by looking at wall time - we are blind.&lt;br /&gt;&lt;br /&gt;Long story short - if OSG ever wants to provide a cloud service using our batch systems, we need better accounting.&lt;br /&gt;&lt;br /&gt;Hence, we are running a technology investigation to bring batch system accounting up to par with EC2's:&amp;nbsp;&lt;a href="https://jira.opensciencegrid.org/browse/TECHNOLOGY-2"&gt;https://jira.opensciencegrid.org/browse/TECHNOLOGY-2&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Our current target is to provide a proof-of-concept using Condor. &amp;nbsp;With Condor 7.7.0's cgroup integration, the CPU/memory usage is very accurate, but network accounting for vanilla jobs is missing. &amp;nbsp;Network accounting is the topic for this post; we have the following goals:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The accounting should be done for all processes spawned during the batch job.&lt;/li&gt;&lt;li&gt;All network traffic should be included.&lt;/li&gt;&lt;li&gt;Separately account LAN traffic from WAN traffic (in EC2, these have different costs).&lt;/li&gt;&lt;/ul&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;The Woes of Linux Network Accounting&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The state of Linux network accounting, well, sucks (for our purposes!). &amp;nbsp;Here's a few ways to tackle it, and why each of them won't work:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Counting packets&lt;/b&gt; through an interface: If you assume that there is only one job per host, you can count the packets that go through a network interface. &amp;nbsp;This is a big, currently unlikely, assumption.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Per-process accounting&lt;/b&gt;: There exists a kernel patch floating around on the internet that adds per-process in/out statistics. &amp;nbsp;However, other than polling frequently, we have no mechanism to account for short-lived processes. &amp;nbsp;Besides, asking folks to run custom kernels is a good way to get ignored.&lt;/li&gt;&lt;li&gt;&lt;b&gt;cgroups&lt;/b&gt;: There is a net controller in cgroups. &amp;nbsp;This marks packets in such a way that they can be manipulated by the &lt;b&gt;tc&lt;/b&gt; utility. &amp;nbsp;&lt;b&gt;tc&lt;/b&gt; controls the layer of buffering before packets are transferred to the network card and can do accounting. &amp;nbsp;Unfortunately:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;In RHEL6, there's no way to persist &lt;b&gt;tc&lt;/b&gt; rules.&lt;/li&gt;&lt;li&gt;This only accounts for &lt;i&gt;outgoing&lt;/i&gt;&amp;nbsp;packets; incoming packets do not pass through.&lt;/li&gt;&lt;li&gt;We cannot distinguish between local network traffic and off-campus network traffic. &amp;nbsp;This can actually be overcome with a technique similar in difficulty to byte packet filters (BPF), but would be difficult.&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;ptrace&lt;/b&gt;&amp;nbsp;or &lt;b&gt;dynamic loader techniques&lt;/b&gt;: There exists libraries (exemplified by&amp;nbsp;&lt;a href="http://www.cse.nd.edu/~ccl/software/parrot/"&gt;parrot&lt;/a&gt;) that provide a mechanism for intercepting calls. &amp;nbsp;We could instrument this. &amp;nbsp;However, this path is notoriously buggy and difficult to maintain: it would require a lot of code, and not work for statically-compiled processes.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;The most full-featured network accounting is in the routing code controlled by&amp;nbsp;&lt;i&gt;iptables&lt;/i&gt;. &amp;nbsp;Particularly, this can account incoming and outgoing traffic, plus&amp;nbsp;differentiate&amp;nbsp;between on-campus and off-campus traffic.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We're going to tackle the problem using iptables; the trick is going to be distinguishing all the traffic from a single batch job. &amp;nbsp;As in the previous series on managing batch system processes, we are going borrow heavily from techniques used in Linux containers.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Per-Batch Job Network Statistics&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To get perfect per-batch-job network statistics that differentiate between local and remote traffic, we will combine iptables, NAT, virtual ethernet devices, and network namespaces. &amp;nbsp;It will somewhat be a tour-de-force of the Linux kernel networking - and currently very manual. &amp;nbsp;Automation is still forthcoming.&lt;br /&gt;&lt;br /&gt;This recipe is a synthesis of the ideas presented in the following pages:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Manually setting up networking for a container:&amp;nbsp;&lt;a href="http://lxc.sourceforge.net/index.php/about/kernel-namespaces/network/configuration/"&gt;http://lxc.sourceforge.net/index.php/about/kernel-namespaces/network/configuration/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Traffic accounting with iptables:&amp;nbsp;&lt;a href="http://www.catonmat.net/blog/traffic-accounting-with-iptables/"&gt;http://www.catonmat.net/blog/traffic-accounting-with-iptables/&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Using a NAT between the "container"&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;We'll be thinking of the batch job as a "semi-container": it will get its own network device like a container, but have more visibility to the OS than in a container. &amp;nbsp;To follow this recipe, you'll need RHEL6 or later.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;First, we'll create a pair of ethernet devices and set up NAT-based routing between them and the rest of the OS. &amp;nbsp;We will assume eth0 is the outgoing network device and that the IPs 192.168.0.1 and 192.168.0.2 are currently not routed in the network.&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Enable IP forwarding:&lt;br /&gt;&lt;pre&gt;echo 1 &amp;gt; /proc/sys/net/ipv4/ip_forward&lt;/pre&gt;&lt;/li&gt;&lt;li&gt;Create an veth ethernet device pair:&lt;br /&gt;&lt;pre&gt;ip link add type veth&lt;/pre&gt;This will create two devices, veth0 and veth1, that act similar to a Unix pipe: bytes sent to veth1 will be received by veth0 (and vice versa).&lt;/li&gt;&lt;li&gt;Assign IPs to the new veth devices; we will use 192.168.0.1 and 192.168.0.2:&lt;br /&gt;&lt;pre&gt;ifconfig veth0 192.168.0.1/24 up&lt;br /&gt;ifconfig veth1 192.168.0.2/24 up&lt;/pre&gt;&lt;/li&gt;&lt;li&gt;Download and compile &lt;a href="https://jira.opensciencegrid.org/secure/attachment/10036/ns_exec.c"&gt;ns_exec.c&lt;/a&gt;; this is a handy utility developed by IBM that allows us to create processes in new namespaces. &amp;nbsp;Compilation can be done like this:&lt;br /&gt;&lt;pre&gt;gcc -o ns_exec ns_exec.c&lt;/pre&gt;This requires a RHEL6 kernel and the kernel headers&lt;/li&gt;&lt;li&gt;In a separate window, launch a new shell in a new network and mount namespace:&lt;br /&gt;&lt;pre&gt;./ns_exec -nm -- /bin/bash&lt;/pre&gt;We'll refer to this as shell 2 and our original window as shell 1.&lt;/li&gt;&lt;li&gt;Use &lt;b&gt;ps&lt;/b&gt;&amp;nbsp;to determine the pid of shell 2. &amp;nbsp;In shell 1, execute:&lt;br /&gt;&lt;pre&gt;ip link set veth1 netns $PID_OF_SHELL_2&lt;/pre&gt;In shell 2, you should be able to run &lt;b&gt;ifconfig&lt;/b&gt; and see veth1.&lt;/li&gt;&lt;li&gt;In shell 2, re-mount the /sys filesystem and enable the loopback device:&lt;br /&gt;&lt;pre&gt;mount -t sysfs none /sys&lt;br /&gt;ifconfig lo up&lt;/pre&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;At this point, we have a "batch job" (shell 2) with its own dedicated networking device. &amp;nbsp;All traffic generated by this process - or its children - must pass through here. &amp;nbsp;Traffic generated in shell 2 will go into veth1 and out veth0. &amp;nbsp;However, we haven't hooked up the routing for veth0, so packets currently stop there; fairly useless.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next, we create a NAT between veth0 and eth0. &amp;nbsp;This is a point of convergence - alternately, we could bridge the networks at layer 2 or layer 3 and provide the job with its own public IP. &amp;nbsp;I'll leave that as an exercise for the reader. &amp;nbsp;For the NAT, I will assume that 129.93.0.0/16 is the on-campus network and everything else is off-campus. &amp;nbsp;Everything will be done in shell 1:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Verify that any firewall won't be blocking NAT packets. &amp;nbsp;If you don't know how to do that, turn off the firewall with &lt;pre&gt;iptables -F&lt;/pre&gt;. &amp;nbsp;If you want a firewall, but don't know how iptables works, then you probably want to spend a few hours learning first.&lt;/li&gt;&lt;li&gt;Enable the packet mangling for NAT:&lt;br /&gt;&lt;pre&gt;iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE&lt;/pre&gt;&lt;/li&gt;&lt;li&gt;Forward packets from veth0 to eth0, using separate rules for on/off campus:&lt;br /&gt;&lt;pre&gt;iptables -A FORWARD -i veth0 -o eth0 --dst 129.93.0.0/16 -j ACCEPT&lt;br /&gt;iptables -A FORWARD -i veth0 -o eth0 ! --dst 129.93.0.0/16 -j ACCEPT&lt;/pre&gt;&lt;/li&gt;&lt;li&gt;Forward TCP connections from eth0 to veth0 using separate rules:&lt;br /&gt;&lt;pre&gt;iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED --src 129.93.0.0/16 -j ACCEPT&lt;br /&gt;iptables -A FORWARD -i eth0 -o veth0 -m state --state RELATED,ESTABLISHED ! --src 129.93.0.0/16 -j ACCEPT&lt;/pre&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;At this point, you can switch back to shell 2 and verify the network is working. &amp;nbsp;iptables will automatically do accounting; you just need to enable command line flags to get it printed:&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: monospace; white-space: pre;"&gt;iptables -L -n -v -x&lt;/span&gt;&lt;br /&gt;If you look at the &lt;a href="http://www.catonmat.net/blog/traffic-accounting-with-iptables/"&gt;network accounting reference&lt;/a&gt;, they show how to separate all the accounting rules into a separate chain. &amp;nbsp;This allows you to, for example, reset counters for only the traffic accounting. &amp;nbsp;On my example host, the output looks like this:&lt;br /&gt;&lt;pre&gt;Chain INPUT (policy ACCEPT 4 packets, 524 bytes)&lt;br /&gt;&amp;nbsp; &amp;nbsp; pkts &amp;nbsp; &amp;nbsp; &amp;nbsp;bytes target &amp;nbsp; &amp;nbsp; prot opt in &amp;nbsp; &amp;nbsp; out &amp;nbsp; &amp;nbsp; source &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; destination &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;br /&gt;&lt;br /&gt;Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)&lt;br /&gt;&amp;nbsp; &amp;nbsp; pkts &amp;nbsp; &amp;nbsp; &amp;nbsp;bytes target &amp;nbsp; &amp;nbsp; prot opt in &amp;nbsp; &amp;nbsp; out &amp;nbsp; &amp;nbsp; source &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; destination &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; 30 &amp;nbsp; &amp;nbsp; 1570 ACCEPT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;veth0 &amp;nbsp;eth0 &amp;nbsp; &amp;nbsp;0.0.0.0/0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;129.93.0.0/16 &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; 18 &amp;nbsp; &amp;nbsp; 1025 ACCEPT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;veth0 &amp;nbsp;eth0 &amp;nbsp; &amp;nbsp;0.0.0.0/0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; !129.93.0.0/16 &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; 28 &amp;nbsp; &amp;nbsp;26759 ACCEPT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;eth0 &amp;nbsp; veth0 &amp;nbsp; 129.93.0.0/16 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0.0.0.0/0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; state RELATED,ESTABLISHED&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; 17 &amp;nbsp; &amp;nbsp;10573 ACCEPT &amp;nbsp; &amp;nbsp; all &amp;nbsp;-- &amp;nbsp;eth0 &amp;nbsp; veth0 &amp;nbsp;!129.93.0.0/16 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0.0.0.0/0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; state RELATED,ESTABLISHED&lt;br /&gt;&lt;br /&gt;Chain OUTPUT (policy ACCEPT 4 packets, 276 bytes)&lt;br /&gt;&amp;nbsp; &amp;nbsp; pkts &amp;nbsp; &amp;nbsp; &amp;nbsp;bytes target &amp;nbsp; &amp;nbsp; prot opt in &amp;nbsp; &amp;nbsp; out &amp;nbsp; &amp;nbsp; source &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; destination &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;As you can see, my "job" has downloaded about 26KB from on-campus and 10KB from off-campus.&lt;br /&gt;&lt;br /&gt;Viola! &amp;nbsp;Network accounting appropriate for a batch system!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-5335055704434760709?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/5335055704434760709/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/09/per-batch-job-network-statistics.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/5335055704434760709'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/5335055704434760709'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/09/per-batch-job-network-statistics.html' title='Per-Batch Job Network Statistics'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-1032662574073920066</id><published>2011-08-26T12:50:00.000-07:00</published><updated>2011-08-26T13:09:08.780-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='openstack'/><category scheme='http://www.blogger.com/atom/ns#' term='virtualization'/><category scheme='http://www.blogger.com/atom/ns#' term='hcc'/><title type='text'>Creating a VM for OpenStack</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: large;"&gt;Intro&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here at HCC, we have a few VM-based projects going. &amp;nbsp;One is the Condor-based VM launching that Ashu referenced in his previous posting. &amp;nbsp;That project is to take an existing capability (Condor batch system hooked to the grid) and extending it; instead of launching processes, one can launch an entire VM.&lt;br /&gt;&lt;br /&gt;One of our other employees, Josh, has been working from the other direction: taking a common "cloud platform", OpenStack, and seeing if it can be adopted to our high-throughput needs. &amp;nbsp;The OpenStack work is in its beginning phases, but bits and pieces are starting to become functional.&lt;br /&gt;&lt;br /&gt;Last night, I tried out install for the first time. &amp;nbsp;One of the initial tasks I wanted to accomplish is to create a custom VM. &amp;nbsp;A lot of the OpenStack documentation is fairly Ubuntu specific, so I've taken their pages and adopted them for installing from a CentOS 5.6 machine. &amp;nbsp;Unfortunately, I didn't take any nice screen shots like Ashu did, but I hope this will be useful to others.&lt;br /&gt;&lt;br /&gt;Long term, we plan to open OpenStack up to select OSG VOs for testing.  While we are still in the "tear it down and rebuild once a week" mode, it's just been opened up to select HCC users. &lt;br /&gt;&lt;br /&gt;So, without further ado, I present...&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Creating a new Fedora image using HCC's OpenStack&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;These notes are based on the upstream openstack documents here:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://docs.openstack.org/trunk/openstack-compute/admin/content/creating-a-linux-image.html"&gt;http://docs.openstack.org/trunk/openstack-compute/admin/content/creating-a-linux-image.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Prerequisites&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It all starts with an account.&lt;br /&gt;&lt;br /&gt;For local users, contact hcc-support to get your access credentials. &amp;nbsp;They will come in a zipfile. &amp;nbsp;Download the zipfile into your home directory and unpack it. &amp;nbsp;Among other things, there will be a &lt;b&gt;novarc&lt;/b&gt; file. &amp;nbsp;Source this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;source novarc&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This will set up environment variables in your shell pointing to your login credentials.  Do not share these with other people!  You will need to do this each time you open a new shell.&lt;br /&gt;&lt;br /&gt;To create the image, you will need root access on a development machine with KVM installed. &amp;nbsp;I used a CentOS 5.6 machine and did:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;yum groupinstall kvm&lt;/pre&gt;&lt;br /&gt;to get the various necessary KVM packages. &amp;nbsp;I als&lt;br /&gt;&lt;br /&gt;First, create a new raw image file:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;qemu-img create -f raw /tmp/server.img 5G&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This will be the block device that is presented to your virtual machine; make it as large as necessary.  Our current hardware is pretty space-limited: smaller is encouraged.  Next, download the Fedora boot ISO:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;curl http://serverbeach1.fedoraproject.org/pub/alt/bfo/bfo.iso &amp;gt; /tmp/bfo.iso&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This is a small, 670KB ISO file that contains just enough information to bootstrap the Anaconda installer. &amp;nbsp;Next, we'll boot it as a virtual machine on your local system.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;sudo /usr/libexec/qemu-kvm -m 2048 -cdrom /tmp/bfo.iso -drive file=/tmp/server.img -boot d -net nic -net user&amp;nbsp;-vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This will create a simple virtual machine (2 cores, 2GB RAM) with &lt;b&gt;/tmp/server.img&lt;/b&gt; as a drive, and boot the machine from /tmp/bfo.iso. &amp;nbsp;It will also allow you to connect to the VM via a VNC viewer.&lt;br /&gt;&lt;br /&gt;If you are physically on the host machine, you can use a VNC viewer for screen ":0". &amp;nbsp;If you are logged in remotely (I log in from my Mac), you'll want to port-forward:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;ssh -L 5900:localhost:5900 username@remotemachine.example.com&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;From your laptop, connect to localhost:0 with a VNC viewer. &amp;nbsp;Note that the most common VNC viewers on the Mac (the built-in Remote Viewer and Chicken of the VNC) don't work with KVM. &amp;nbsp;I found that "JollyFastVNC" works, but costs $5 from the App Store.&lt;br /&gt;&lt;br /&gt;Once logged in, select the version of Fedora you'd like to install, and "click next" until the installation is done. &amp;nbsp;Fedora 15 is sure nice :)&lt;br /&gt;&lt;br /&gt;Fedora will want to reboot the machine, but the reboot will fail because KVM is set to only boot from the CD. &amp;nbsp;So, once it tries to reboot, kill KVM and start it again with the following arguments:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/server.img -net nic -net user&amp;nbsp;-vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Again, connect via VNC, and do any post-install customization. &amp;nbsp;Start by updating and turning on SSH:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;yum update&lt;br /&gt;yum install openssh-server&lt;br /&gt;chkconfig sshd on&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;You will need to tweak &lt;b&gt;/etc/fstab&lt;/b&gt; to make it suitable for a cloud instance. &amp;nbsp;Nova-compute may resize the disk at the time of launch of instances based on the instance type chosen. This can make the UUID of the disk invalid. &amp;nbsp;Further, we will remove the LVM setup, and just have the root partition present (no swap, no &lt;b&gt;/boot&lt;/b&gt;).&lt;br /&gt;&lt;br /&gt;Edit &lt;b&gt;/mnt/etc/fstab&lt;/b&gt;. &amp;nbsp;Change the following three lines:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;/dev/mapper/VolGroup-lv_root / &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ext4 &amp;nbsp; &amp;nbsp;defaults &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1 1&lt;br /&gt;UUID=0abae194-64c8-4d13-a4c0-6284d9dcd7b4 /boot &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ext4 &amp;nbsp; &amp;nbsp;defaults &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1 2&lt;br /&gt;/dev/mapper/VolGroup-lv_swap swap &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;swap &amp;nbsp; &amp;nbsp;defaults &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;to just one line:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;LABEL=uec-rootfs &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;/ &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ext4 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; defaults &amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp;0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Since, Fedora does not ship with an init script for OpenStack, we will do a nasty hack for pulling the correct SSH key at boot.  Edit the /etc/rc.local file and add the following lines before the line "touch /var/lock/subsys/local":&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;depmod -a&lt;br /&gt;modprobe acpiphp&lt;br /&gt;&lt;br /&gt;# simple attempt to get the user ssh key using the meta-data service&lt;br /&gt;mkdir -p /root/.ssh&lt;br /&gt;echo &amp;gt;&amp;gt; /root/.ssh/authorized_keys&lt;br /&gt;curl -m 10 -s http://169.254.169.254/latest/meta-data/public-keys/0/openssh-key | grep 'ssh-rsa' &amp;gt;&amp;gt; /root/.ssh/authorized_keys&lt;br /&gt;echo "AUTHORIZED_KEYS:"&lt;br /&gt;echo "************************"&lt;br /&gt;cat /root/.ssh/authorized_keys&lt;br /&gt;echo "************************"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Once you are finished customizing, go ahead and power off:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;poweroff&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Converting to an acceptable OpenStack format&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The image that needs to be uploaded to OpenStack needs to be an ext4 filesystem image; we currently have a raw block device image. &amp;nbsp;We will extract this filesystem from running a few commands on the host machine. &amp;nbsp;First, we need to find out the starting sector of the partition. Run:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;fdisk -ul /tmp/server.img&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;You should see an output like this (the error messages are harmless):&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;last_lba(): I don't know how to handle files with mode 81a4&lt;br /&gt;You must set cylinders.&lt;br /&gt;You can do this from the extra functions menu.&lt;br /&gt;&lt;br /&gt;Disk /dev/loop0: 5368 MB, 5368709120 bytes&lt;br /&gt;255 heads, 63 sectors/track, 652 cylinders, total 10485760 sectors&lt;br /&gt;Units = sectors of 1 * 512 = 512 bytes&lt;br /&gt;&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; Device Boot &amp;nbsp; &amp;nbsp; &amp;nbsp;Start &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; End &amp;nbsp; &amp;nbsp; &amp;nbsp;Blocks &amp;nbsp; Id &amp;nbsp;System&lt;br /&gt;/dev/loop0p1 &amp;nbsp; * &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;2048 &amp;nbsp; &amp;nbsp; 1026047 &amp;nbsp; &amp;nbsp; &amp;nbsp;512000 &amp;nbsp; 83 &amp;nbsp;Linux&lt;br /&gt;Partition 1 does not end on cylinder boundary.&lt;br /&gt;/dev/loop0p2 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 1026048 &amp;nbsp; &amp;nbsp;10485759 &amp;nbsp; &amp;nbsp; 4729856 &amp;nbsp; 8e &amp;nbsp;Linux LVM&lt;br /&gt;Partition 2 does not end on cylinder boundary.&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Note the following commands assume the units are 512 bytes. &amp;nbsp;You will need the start and end number for the "Linux LVM"; in this case, it is 1026048 and 10485759.&lt;br /&gt;&lt;br /&gt;Copy the entire partition to a new file&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;dd if=/tmp/server.img of=/tmp/server.lvm.img skip=1026048 count=$((10485759-1026048)) bs=512&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;For "skip" and "count", use the begin and end you copy/pasted from the fdisk output.  Now we have our LVM image; we'll need to activate it. &amp;nbsp;First, mount the LVM image on the loopback device and look for the volume group name:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ sudo /sbin/losetup /dev/loop0 /tmp/server.lvm.img&lt;br /&gt;[bbockelm@localhost ~]$ sudo /sbin/pvscan&lt;br /&gt;&amp;nbsp; PV /dev/sdb1 &amp;nbsp; &amp;nbsp;VG vg_home &amp;nbsp; &amp;nbsp; lvm2 [7.20 TB / 0 &amp;nbsp; &amp;nbsp;free]&lt;br /&gt;&amp;nbsp; PV /dev/sda2 &amp;nbsp; &amp;nbsp;VG vg_system &amp;nbsp; lvm2 [73.88 GB / 0 &amp;nbsp; &amp;nbsp;free]&lt;br /&gt;&amp;nbsp; PV /dev/loop0 &amp;nbsp; VG VolGroup &amp;nbsp; &amp;nbsp;lvm2 [4.50 GB / 0 &amp;nbsp; &amp;nbsp;free]&lt;br /&gt;&amp;nbsp; Total: 3 [1.28 TB] / in use: 3 [1.28 TB] / in no VG: 0 [0 &amp;nbsp; ]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Note the third listing is for our loopback device (&lt;b&gt;/dev/loop0&lt;/b&gt;) and a volume group named, simply, "VolGroup". &amp;nbsp;We'll want to activate that:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ sudo /sbin/vgchange -ay VolGroup&lt;br /&gt;&amp;nbsp; 2 logical volume(s) in volume group "VolGroup" now active&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;We can now see the Fedora root file system in &lt;b&gt;/dev/VolGroup/lv_root&lt;/b&gt;.&amp;nbsp; We use dd to make a copy of this disk:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;sudo dd if=/dev/VolGroup/lv_root of=/tmp/serverfinal.img&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I get the following output:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ sudo dd if=/dev/VolGroup/lv_root of=/tmp/serverfinal2.img&lt;br /&gt;3145728+0 records in&lt;br /&gt;3145728+0 records out&lt;br /&gt;1610612736 bytes (1.6 GB) copied, 14.5444 seconds, 111 MB/s&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;It's time to unmount all our devices. &amp;nbsp;Start by removing the LVM:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ sudo /sbin/vgchange -an VolGroup&lt;br /&gt;&amp;nbsp; 0 logical volume(s) in volume group "VolGroup" now active&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Then, unmount our loopback device:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ sudo /sbin/losetup -d /dev/loop0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;We will do one last tweak: change the label on our filesystem image to "&lt;b&gt;uec-rootfs&lt;/b&gt;":&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;sudo /sbin/tune2fs -L uec-rootfs /tmp/serverfinal.img&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;*Note* that your filesystem image is &lt;b&gt;ext4&lt;/b&gt;; if your host is RHEL5.x (this is my case!), your version of tune2fs will not be able to complete this operation. &amp;nbsp;In this case, you will need to restart your VM in KVM with the newly-extracted serverfinal.img as a second hard drive. &amp;nbsp;I did the following KVM invocation:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/server.img -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize -drive file=/tmp/serverfinal.img&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The second drive shows up as &lt;b&gt;/dev/sdb&lt;/b&gt;; go ahead and re-execute &lt;b&gt;tune2fs&lt;/b&gt; from within the VM:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[root@localhost ~]# tune2fs -L uec-rootfs /dev/sdb&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Extract Kernel and Initrd for OpenStack&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Fedora creates a small boot partition separate from the LVM we extracted previously. &amp;nbsp;We'll need to mount it, and copy out the kernel and initrd. &amp;nbsp;First, mount the loopback device and map the partitions.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ sudo /sbin/losetup -f /tmp/server.img&lt;br /&gt;[bbockelm@localhost ~]$ sudo /sbin/kpartx -a /dev/loop0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The boot partition should now be available at &lt;b&gt;/dev/mapper/loop0p1&lt;/b&gt;. &amp;nbsp;Mount this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ sudo mkdir &amp;nbsp;/tmp/server_image/&lt;br /&gt;[bbockelm@localhost ~]$ sudo mount /dev/mapper/loop0p1 &amp;nbsp;/tmp/server_image/&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now, copy out the kernel and initrd:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ cp /tmp/server_image/vmlinuz-2.6.40.3-0.fc15.x86_64 ~&lt;br /&gt;[bbockelm@localhost ~]$ cp /tmp/server_image/initramfs-2.6.40.3-0.fc15.x86_64.img ~&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Unmount and unmap:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ sudo umount /tmp/server_image&lt;br /&gt;[bbockelm@localhost ~]$ sudo /sbin/kpartx -d /dev/loop0&lt;br /&gt;[bbockelm@localhost ~]$ sudo /sbin/losetup -d /dev/loop0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Upload into OpenStack&lt;br /&gt;&lt;br /&gt;We need to bundle, then upload the kernel, initrd, and finally the image. &amp;nbsp;First, the kernel:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@localhost ~]$ euca-bundle-image -i ~/vmlinuz-2.6.40.3-0.fc15.x86_64 --kernel true&lt;br /&gt;Checking image&lt;br /&gt;Encrypting image&lt;br /&gt;Splitting image...&lt;br /&gt;Part: vmlinuz-2.6.40.3-0.fc15.x86_64.part.00&lt;br /&gt;Generating manifest /tmp/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml&lt;br /&gt;[bbockelm@localhost ~]$ euca-upload-bundle -b testbucket -m /tmp/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml&lt;br /&gt;Checking bucket: testbucket&lt;br /&gt;Uploading manifest file&lt;br /&gt;Uploading part: vmlinuz-2.6.40.3-0.fc15.x86_64.part.00&lt;br /&gt;Uploaded image as testbucket/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml&lt;br /&gt;[bbockelm@localhost ~]$ euca-register testbucket/vmlinuz-2.6.40.3-0.fc15.x86_64.manifest.xml&lt;br /&gt;IMAGE&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;	&lt;/span&gt;aki-0000000a&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Write down the kernel ID; it is &lt;b&gt;aki-0000000a&lt;/b&gt; above.  Then, the initrd:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;euca-bundle-image -i ~/initramfs-2.6.40.3-0.fc15.x86_64.img --ramdisk true&lt;br /&gt;euca-upload-bundle -b testbucket -m /tmp/initramfs-2.6.40.3-0.fc15.x86_64.img.manifest.xml&lt;br /&gt;euca-register testbucket/initramfs-2.6.40.3-0.fc15.x86_64.img.manifest.xml&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;My initrd's ID was &lt;b&gt;ari-0000000b&lt;/b&gt;.  Finally, the disk image itself&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;euca-bundle-image --kernel aki-0000000a --ramdisk ari-0000000b -i /tmp/serverfinal.img -r x86_64&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This will save the image into &lt;b&gt;/tmp&lt;/b&gt; and named "&lt;b&gt;serverfinal.img.manifest.xml&lt;/b&gt;".  I didn't particularly care for the name, so I changed it to "&lt;b&gt;fedora-15.img.manifest.xml&lt;/b&gt;". &amp;nbsp;Now, upload:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;euca-upload-bundle -b testbucket -m /tmp/fedora-15.img.manifest.xml&lt;br /&gt;euca-register testbucket/serverfinal2.img.manifest.xml&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Congratulations!  You now have a brand-new Fedora-15 image ready to use.  Fire up HybridFox and see if you were successful.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-1032662574073920066?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/1032662574073920066/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/08/creating-vm-for-openstack.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/1032662574073920066'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/1032662574073920066'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/08/creating-vm-for-openstack.html' title='Creating a VM for OpenStack'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-3836501507767218312</id><published>2011-08-18T16:54:00.000-07:00</published><updated>2011-08-18T17:01:14.098-07:00</updated><title type='text'>KVM and Condor (Part 1): Creating the virtual machine.</title><content type='html'>My next topic of discussion which will be a two part blog is regarding launching a Virtual Machine (VM) in a Condor environment.&amp;nbsp; In the first of these two blogs I will share the steps that I took to create a VM that I will launch as a job in Condor. &lt;br /&gt;&lt;br /&gt;I will be using Kernel-based Virtual Machine (KVM) implementation for Linux Guests.&amp;nbsp; KVM is a full virtualization framework which can run multiple unmodified guests including various flavors of Microsoft Windows, Linux Operating Systems and other UNIX family systems. In order to see the types of Guest operating systems and platforms that KVM supports you can look at &lt;a href="http://www.linux-kvm.org/page/Guest_Support_Status"&gt;http://www.linux-kvm.org/page/Guest_Support_Status&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Let’s get started. For this blog the host system on which I am working is running CentOS&amp;nbsp; 6.0 with Linux 2.6.32 on a x86_64 platform.&amp;nbsp; I will be creating a CentOS 5.6 image for the VM guest.&amp;nbsp; As the first step, I will get my host system ready with KVM tools and other dependencies. To do this I require a package called kvm&amp;nbsp; – this package includes the VM kernel module. In addition to the kvm package I will be using three tools (viz. virt-install, virsh,&amp;nbsp; and virt-viewer) from toolkit called libvirt. Libvirt (&lt;a href="http://libvirt.org/"&gt;http://libvirt.org/&lt;/a&gt;) is a hypervisor-independent API that is able to interact with the virtualization capabilities of various operating systems. The commands below show you how to use yum to install kvm and libvirt related packages:&lt;br /&gt;&lt;br /&gt;&lt;div class="csharpcode"&gt;&lt;pre class="alt"&gt;yum install kvm&lt;/pre&gt;&lt;br /&gt;&lt;pre class="alt"&gt;yum install virt-manager libvirt libvirt-python python-virtinst libvirt-client&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;I am now ready to create the VM by using the following command:&lt;br /&gt;&lt;br /&gt;&lt;div class="csharpcode"&gt;&lt;pre class="alt"&gt;&lt;span class="lnum"&gt;   1:  &lt;/span&gt;virt-install \&lt;/pre&gt;&lt;pre&gt;&lt;span class="lnum"&gt;   2:  &lt;/span&gt;--name=vm56-25GB \&lt;/pre&gt;&lt;pre class="alt"&gt;&lt;span class="lnum"&gt;   3:  &lt;/span&gt;--disk path=/home/aguru/myvms/vm5.6-25GB.img,sparse=true,size=25 \&lt;/pre&gt;&lt;pre&gt;&lt;span class="lnum"&gt;   4:  &lt;/span&gt;--ram=2048 \&lt;/pre&gt;&lt;pre class="alt"&gt;&lt;span class="lnum"&gt;   5:  &lt;/span&gt;--location=http://mirror.unl.edu/centos/5.6/os/x86_64/ \&lt;/pre&gt;&lt;pre&gt;&lt;span class="lnum"&gt;   6:  &lt;/span&gt;--os-type=linux  \&lt;/pre&gt;&lt;pre class="alt"&gt;&lt;span class="lnum"&gt;   7:  &lt;/span&gt;--vnc&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;&lt;style type="text/css"&gt; .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt  { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } &lt;/style&gt;&lt;br /&gt;&lt;br /&gt;In the above code snippet 'virt-install' is a libvirt command line tool for provisioning new virtual machines. The different options that I have used above are explained below&lt;br /&gt;--name is the name of the new machine that I am creating&lt;br /&gt;--disk option specifies the absolute path of the virtual machine image (file) that will be created. The ‘sparse’ option in the same line means that the host system does not have to allocate all the space up-front, and the ‘size’ gives the size of the hard disk drive of the VM in GB &lt;br /&gt;--ram is the RAM of guest in MB&lt;br /&gt;--location&amp;nbsp; using this option I am providing a location for network install where the OS install files for the guest are located&lt;br /&gt;--os-type specifies type of guest operating system&lt;br /&gt;--vnc specifies to setup a virtual console in the guest and export it as a VNC server in host&lt;br /&gt;&lt;br /&gt;Unless there are any missing dependencies and tools that somehow did not get installed correctly - your install should start with a new VNC window popping up on your display. I have a few screen captures of what you may see shown below.&lt;br /&gt;&lt;br /&gt;** Just a quick note - to release the mouse cursor from the VNC window you can use Ctrl-Alt.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&lt;a href="http://lh6.ggpht.com/-9wMOqGQ2CiA/Tk1YVtEi25I/AAAAAAAAAMk/QWF77WuG4fM/s1600-h/1%25255B3%25255D.jpg"&gt;&lt;img alt="1" border="0" src="http://lh5.ggpht.com/-751PIAkNgkg/Tk1YVzJ8FtI/AAAAAAAAAMo/A2KtctXZ7mQ/1_thumb%25255B1%25255D.jpg?imgmax=800" style="background-image: none; border-color: -moz-use-text-color; border-style: none; border-width: 0px; display: inline; padding-left: 0px; padding-right: 0px; padding-top: 0px;" title="1" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lh3.ggpht.com/-raNXC8ymotA/Tk1YWL419ZI/AAAAAAAAAMs/vn3I_0opTUk/s1600-h/2%25255B4%25255D.jpg"&gt;&lt;img alt="2" border="0" src="http://lh4.ggpht.com/-KjxftdblTzs/Tk1YWrl0chI/AAAAAAAAAMw/GFzZ4-A3Qp8/2_thumb%25255B2%25255D.jpg?imgmax=800" style="background-image: none; border-color: -moz-use-text-color; border-style: none; border-width: 0px; display: inline; padding-left: 0px; padding-right: 0px; padding-top: 0px;" title="2" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lh5.ggpht.com/-oM3zcV5Deik/Tk1YW7cwiEI/AAAAAAAAAM0/47OJ1QAJotg/s1600-h/3%25255B3%25255D.jpg"&gt;&lt;img alt="3" border="0" src="http://lh6.ggpht.com/-VVGXgSrPWo4/Tk1YXXWrfRI/AAAAAAAAAM4/1AhzaL4mAJU/3_thumb%25255B1%25255D.jpg?imgmax=800" style="background-image: none; border-color: -moz-use-text-color; border-style: none; border-width: 0px; display: inline; padding-left: 0px; padding-right: 0px; padding-top: 0px;" title="3" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lh5.ggpht.com/-0gKgr1NTDsE/Tk1YXg3lYoI/AAAAAAAAAM8/HhmGoLibJ4k/s1600-h/4%25255B3%25255D.jpg"&gt;&lt;img alt="4" border="0" src="http://lh3.ggpht.com/-YtgKpoAoxDE/Tk1YYGT59wI/AAAAAAAAANA/ukVoq386Gwg/4_thumb%25255B1%25255D.jpg?imgmax=800" style="background-image: none; border-color: -moz-use-text-color; border-style: none; border-width: 0px; display: inline; padding-left: 0px; padding-right: 0px; padding-top: 0px;" title="4" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lh3.ggpht.com/-8Qlx61mdXZ4/Tk1YYcPIqvI/AAAAAAAAANE/_B1_SCYSyYY/s1600-h/5%25255B3%25255D.jpg"&gt;&lt;img alt="5" border="0" src="http://lh4.ggpht.com/-6t28N0tW218/Tk1YYpq4ULI/AAAAAAAAANI/gm25iElh-N8/5_thumb%25255B1%25255D.jpg?imgmax=800" style="background-image: none; border-color: -moz-use-text-color; border-style: none; border-width: 0px; display: inline; padding-left: 0px; padding-right: 0px; padding-top: 0px;" title="5" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;and so on with finally a screen as below&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://lh6.ggpht.com/-5gxUHaA0-Uo/Tk1YY1C9K4I/AAAAAAAAANM/LHpCrUxVNL0/s1600-h/14%25255B3%25255D.jpg"&gt;&lt;img alt="14" border="0" src="http://lh6.ggpht.com/-DQx2SMUCfUo/Tk1YZb9PQlI/AAAAAAAAANQ/lBOTCqY-uSg/14_thumb%25255B1%25255D.jpg?imgmax=800" style="background-image: none; border-color: -moz-use-text-color; border-style: none; border-width: 0px; display: inline; padding-left: 0px; padding-right: 0px; padding-top: 0px;" title="14" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;On the final screen of installation you can click the 'Reboot' button from the VM window to restart the guest VM. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Few basic commands to list, start and stop a VM&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="csharpcode"&gt;&lt;pre class="alt"&gt;virsh list –all&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;&lt;style type="text/css"&gt; .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt  { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } &lt;/style&gt;&lt;br /&gt;&lt;br /&gt;The output of virsh list --all shows the defined VMs and their current state for e.g. a typical output may look like:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre class="csharpcode"&gt;Id Name                 State&lt;br /&gt;----------------------------------&lt;br /&gt;- vm56-15KSGB          shut off&lt;br /&gt;- vm56-25GB            shut off&lt;/pre&gt;&lt;style type="text/css"&gt; .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt  { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } &lt;/style&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In order to start a VM from the shut off state issue a virsh start command. Note below that the virsh list –all now shows an Id and the running state of the VM (vm56-15KSGB)&lt;br /&gt;&lt;br /&gt;&lt;pre class="csharpcode"&gt;virsh start vm56-15KSGB&lt;br /&gt;&lt;br /&gt;virsh list --all&lt;br /&gt;Id Name                 State&lt;br /&gt;----------------------------------&lt;br /&gt;1 vm56-15KSGB          running&lt;br /&gt;- vm56-25GB            shut off&amp;nbsp;&lt;/pre&gt;&lt;pre class="csharpcode"&gt;&amp;nbsp;&lt;/pre&gt;&lt;br /&gt;To launch a VNC console for displaying the console of a running VM you can use virt-viewer e.g.&lt;br /&gt;&lt;br /&gt;&lt;pre class="csharpcode"&gt;virt-viewer  1&lt;/pre&gt;&lt;style type="text/css"&gt; .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt  { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } &lt;/style&gt;&lt;br /&gt;&lt;br /&gt;And finally, to shutdown a running VM use virsh shutdown or force a virsh destroy e.g. &lt;br /&gt;&lt;pre class="csharpcode"&gt;&amp;nbsp;&lt;/pre&gt;&lt;pre class="csharpcode"&gt;virsh shutdown 1&lt;/pre&gt;&lt;b&gt;or&lt;/b&gt;&lt;br /&gt;&lt;pre class="csharpcode"&gt;virsh destroy 1&lt;br /&gt;&lt;/pre&gt;&lt;style type="text/css"&gt; .csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt  { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; } &lt;/style&gt;&lt;br /&gt;&lt;br /&gt;Both virt-viewer and virsh shutdown take the Id of the running VM as an argument.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;What if I have a Kickstart file for the VM I want to create?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In case you have a Kickstart file that you will like to use for creating the VM you may use the following command:&lt;br /&gt;&lt;br /&gt;&lt;div class="csharpcode"&gt;&lt;pre class="alt"&gt;&lt;span class="lnum"&gt;   1:  &lt;/span&gt;virt-install \&lt;/pre&gt;&lt;pre&gt;&lt;span class="lnum"&gt;   2:  &lt;/span&gt;--name=vm56-15KSGB \&lt;/pre&gt;&lt;pre class="alt"&gt;&lt;span class="lnum"&gt;   3:  &lt;/span&gt;--disk path=/home/aguru/myvms/vm56-15KSGB.img,sparse=&lt;span class="kwrd"&gt;true&lt;/span&gt;,size=15 \&lt;/pre&gt;&lt;pre&gt;&lt;span class="lnum"&gt;   4:  &lt;/span&gt;--ram=2048 \&lt;/pre&gt;&lt;pre class="alt"&gt;&lt;span class="lnum"&gt;   5:  &lt;/span&gt;--location=http:&lt;span class="rem"&gt;//newman.ultralight.org/os/centos/5.5/x86_64 \&lt;/span&gt;&lt;/pre&gt;&lt;pre&gt;&lt;span class="lnum"&gt;   6:  &lt;/span&gt;--os-type=linux  \&lt;/pre&gt;&lt;pre class="alt"&gt;&lt;span class="lnum"&gt;   7:  &lt;/span&gt;--vnc \&lt;/pre&gt;&lt;pre&gt;&lt;span class="lnum"&gt;   8:  &lt;/span&gt;-x &lt;span class="str"&gt;"ks=http://httpdserver.hosting.kickstart/pathto.kickstart.file"&lt;/span&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;The only thing to note which is additional in this virt-install command as compared to its previous&amp;nbsp; use in this blog is the extra flag '–x '. The value passed along with the -x flag points to the location of the web location of the kickstart file.&lt;br /&gt;&lt;br /&gt;That is it all for this post. In the next post I will talk about using this created image and then launching it in a Condor VM Universe.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-3836501507767218312?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/3836501507767218312/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/08/kernel-based-virtualization-and-condor.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/3836501507767218312'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/3836501507767218312'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/08/kernel-based-virtualization-and-condor.html' title='KVM and Condor (Part 1): Creating the virtual machine.'/><author><name>Ashu Guru</name><uri>http://www.blogger.com/profile/02470446389774568545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-VKhZCQt1S5g/TfLVb8SsSBI/AAAAAAAAAK4/mMzuh0xz_zs/s220/image1.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://lh5.ggpht.com/-751PIAkNgkg/Tk1YVzJ8FtI/AAAAAAAAAMo/A2KtctXZ7mQ/s72-c/1_thumb%25255B1%25255D.jpg?imgmax=800' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-7225840585468049745</id><published>2011-07-12T19:23:00.000-07:00</published><updated>2011-07-12T19:23:13.496-07:00</updated><title type='text'>Squid Caching in OSG  Environment</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="containerbody"&gt;A few months back I assisted a research group from University of Nebraska Medical Center (UNMC) in deploying a search for mass spectrometry-based proteomics analysis. This search was performed using a program called The Open Mass Spectrometry Search Algorithm (OMSSA)  using the Open Science Grid (OSG) via GlideinWMS Frontend. In this blog I will talk about the motivation and use of HTTP file transfer along with squid caching for input data and executable files for the jobs deployed over the OSG. I will also show a basic example explaining the use of Squid in the OSG environment.&lt;br /&gt;&lt;br /&gt;While working with the UNMC research group and after looking at the OMSSA specifications and documentation we identified the following characteristics regarding the computation and the data handling requirements for the proteomics analysis:&lt;br /&gt;•&amp;nbsp;&amp;nbsp; &amp;nbsp;A total of 45 datasets with each dataset of about 21MB.&lt;br /&gt;•&amp;nbsp;&amp;nbsp; &amp;nbsp;22,000 comparisons/searches (short jobs) per dataset&lt;br /&gt;•&amp;nbsp;&amp;nbsp; The executables along with search libraries for the comparison sum up to a total of 83MB as a compressed archive.&lt;br /&gt;&amp;nbsp;Based on the above requirements and a few additional tests it was determined that the job is well adapted for OSG via GlideinWMS. It was also decided that each GlideinWMS job will contain about 172 comparisons which calculates to a total of about 5756 individual jobs (22000*45/172).&lt;br /&gt;&lt;br /&gt;Data in the Open Science Grid has always been more difficult to handle than computation. The challenges get more difficult when either of the number of jobs, or the data size increase. There are various methods that are used to overcome and simplify these challenges. Table 1 below shows a rule of thumb that I generally follow to help identify the best mode of data transfers for jobs in OSG environment. Each data transfer method in&amp;nbsp; Table 1 has its own advantages viz. Condor’s internal file transfer is built-in method so no extra scripting is required. SRM can handle large data stores, and has the ability to handle large size data transfers. Pre-staging can distribute the load of pulling down data. &lt;br /&gt;&lt;br /&gt;&lt;table style='border:1px solid #ccc;'&gt;&lt;caption&gt;Table 1. Rule of thumb for data transfer using condor/GlideinWMS jobs in OSG&lt;/caption&gt; &lt;tbody&gt;&lt;tr&gt; &lt;th style='border:1px solid #ccc;'&gt;Data Size&lt;/th&gt;&lt;th style='border:1px solid #ccc;'&gt;Data Transfer Method&lt;/th&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td style='border:1px solid #ccc;'&gt;&amp;lt; 10MB&lt;/td&gt;&lt;td style='border:1px solid #ccc;'&gt;Condor's File Transfer Mechanism&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td style='border:1px solid #ccc;'&gt;10MB - 500MB &lt;/td&gt;&lt;td style='border:1px solid #ccc;'&gt;Storage Element(SE)/Storage Resource Manager(SRM) interface&lt;/td&gt; &lt;/tr&gt;&lt;tr&gt; &lt;td style='border:1px solid #ccc;'&gt;&amp;gt; 500MB&lt;/td&gt;&lt;td style='border:1px solid #ccc;'&gt;SRM/dCache or Pre-staging&lt;/td&gt; &lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;When the number of jobs are significantly large and the data transfer size reaches the higher limits of Condor Internal File transfer, in our past experience we have found that HTTP file transfer has been fairly successful for us. By doing so we are able to distribute away the load of input file and executables transfer from the GlideinWMS&amp;nbsp; Frontend server.&amp;nbsp; For the proteomics analysis project since the compressed archive of the search library and the executables (83MB) was the same across all jobs, and the input data was the same for individual datasets we decided to extend the limits on our HTTP file transfer experiences by adding squid caching. The advantage of caching becomes more evident when more jobs are allocated compute nodes at a given site having a local (site specific) squid server until we reach the limit of the squid server itself.&lt;br /&gt;&lt;br /&gt;Every CMS and ATLAS site is required to have squid whose location is available via the environment variable OSG_SQUID_LOCATION. This implies that by using a very simple wrapper script on a compute node one can easily pull down input files and/or executables using client tool such as wget or curl and then proceed with the actual computation. The example below shows a bash script that reads the OSG_SQUID_LOCATION environment variable on a compute node  and then tries to download the file via squid, on a failure the script downloads the file directly from the source.&amp;nbsp; (Ref: https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;#!/bin/sh&lt;br /&gt;website=http://google.com/&lt;br /&gt;&lt;br /&gt;#Section A&lt;br /&gt;source $OSG_GRID/setup.sh&lt;br /&gt;export OSG_SQUID_LOCATION=${OSG_SQUID_LOCATION:-UNAVAILABLE}&lt;br /&gt;if [ "$OSG_SQUID_LOCATION" != UNAVAILABLE ]; then&lt;br /&gt;  export http_proxy=$OSG_SQUID_LOCATION&lt;br /&gt;fi&lt;br /&gt;&lt;br /&gt;#Section B&lt;br /&gt;wget --retry-connrefused --waitretry=20 $website&lt;br /&gt;&lt;br /&gt;#Section C &lt;br /&gt;#Check if the download worked&lt;br /&gt;if [ $? -ne 0 ]&lt;br /&gt;then&lt;br /&gt;   unset http_proxy&lt;br /&gt;   wget --retry-connrefused --waitretry=20 $website&lt;br /&gt;   if [ $? -ne 0 ]&lt;br /&gt;   then&lt;br /&gt;      exit 1&lt;br /&gt;   fi&lt;br /&gt;fi&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;Listed below is the explanation of the above code:&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Section A: Check if environment variable OSG_SQUID_LOCATION is set, if so then export its value as the environment variable http_proxy which is used by wget for&amp;nbsp; squid server location&lt;/li&gt;&lt;li&gt;Section B: Download the file using wget, the flag --retry-connrefused considers a connection refused as a transient error and tries again. This option helps to handle short term failures. The wait time of 20 seconds in between retries&amp;nbsp; is specified via --waitretry&lt;/li&gt;&lt;li&gt;Section C: If download from the squid server fails then access the actual http source after unsetting the value of http_proxy &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In addition to the availability of a OSG site specific squid server, for this type of data transfer to work one will require a reliable http server which can handle download requests from sites where the squid server is unavailable. Also, the http server must be able to handle requests which are originating from the squid servers along with any failover requests. At UNL we have setup a dedicated HTTP serving infrastructure that has a load balanced failover. This is implemented using the Linux Virtual server and its implementation details are shown in the diagram below.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-OTA4BpWAxwI/ThyNNxvfHDI/AAAAAAAAAL8/jV-EG0BkI98/s1600/LVSsetup.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://4.bp.blogspot.com/-OTA4BpWAxwI/ThyNNxvfHDI/AAAAAAAAAL8/jV-EG0BkI98/s320/LVSsetup.jpg" width="302" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;You can see more detailed examples of squid usage at &lt;a href="https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics%20"&gt;https://twiki.grid.iu.edu/bin/view/Documentation/OsgHttpBasics&amp;nbsp; &lt;/a&gt;&lt;br /&gt;There is also an excellent presentation by Derek Weitzel available at &lt;a href="http://docs.google.com/viewer?url=https%3A%2F%2Ftwiki.grid.iu.edu%2Ftwiki%2Fpub%2FCampusGrids%2FApr27%252c2011%2FCampusGridSquid.pdf&amp;amp;embedded=true"&gt;http://docs.google.com/viewer?url=https%3A%2F%2Ftwiki.grid.iu.edu%2Ftwiki%2Fpub%2FCampusGrids%2FApr27%252c2011%2FCampusGridSquid.pdf&amp;amp;embedded=true&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-7225840585468049745?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/7225840585468049745/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/07/squid-caching-in-osg-environment.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/7225840585468049745'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/7225840585468049745'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/07/squid-caching-in-osg-environment.html' title='Squid Caching in OSG  Environment'/><author><name>Ashu Guru</name><uri>http://www.blogger.com/profile/02470446389774568545</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://4.bp.blogspot.com/-VKhZCQt1S5g/TfLVb8SsSBI/AAAAAAAAAK4/mMzuh0xz_zs/s220/image1.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-OTA4BpWAxwI/ThyNNxvfHDI/AAAAAAAAAL8/jV-EG0BkI98/s72-c/LVSsetup.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-1891031173632648876</id><published>2011-07-08T16:09:00.000-07:00</published><updated>2011-07-08T16:09:37.217-07:00</updated><title type='text'>Part III: Bulletproof process tracking with cgroups</title><content type='html'>Finally, it's time to provide a &lt;b&gt;good&lt;/b&gt; solution for accomplishing process tracking in a Linux batch system.&lt;br /&gt;If you recall in &lt;a href="http://osgtech.blogspot.com/2011/06/how-your-batch-system-watches-your.html"&gt;Part I&lt;/a&gt;, we surveyed common methods for process tracking and ultimately concluded that batch systems used userspace mechanisms (most of which were originally designed for shell-based process control, by the way) that were unreliable, or couldn't detect when failures occur.&amp;nbsp; In &lt;a href="http://osgtech.blogspot.com/2011/06/part-ii-keeping-mindful-eye-on-your.html"&gt;Part II&lt;/a&gt;, the picture brightened: the kernel provided an event feed about process births and deaths, and informed us when messages were dropped.&lt;br /&gt;&lt;br /&gt;In this post, we'll talk about a new feature called "cgroups", short for "control groups".&amp;nbsp; Cgroups are a mechanism in the Linux kernel for managing a set of processes and all their descendents.&amp;nbsp; They are managed through a filesystem-like interface (in the manner of /proc); the directory structure expresses the fact they are hierarchical, and filesystem permissions can be used to restrict the set of users allowed to manipulate them.&amp;nbsp; By default, only root is allowed to manipulate control groups: unlike the process groups, process trees, and environment cookies examined before, a process typically has no ability to change its group.&amp;nbsp; Further, unlike the proc connector API, the control group is assigned synchronously by the kernel at process creation time.&amp;nbsp; Hence, fork-bombs are not an effective way to escape from the group.&lt;br /&gt;&lt;br /&gt;While having the tracking done by the kernel is an immense improvement, the true power of cgroups become apparent through the use of multiple subsystems.&amp;nbsp; Different cgroup subsystems may act to control scheduler policy, allocate or limit resources, or account for usage.&lt;br /&gt;&lt;br /&gt;For example, the &lt;i&gt;memory&lt;/i&gt; controller can be used to limit the amount of memory used by a set of processes.&amp;nbsp; This is a huge improvement over the previous memory limit technique (rlimit), where the limit was assigned per-process.&amp;nbsp; With rlimit, you could limit a single process to 1GB, but the job would just spawn N processes of 1GB each, sidestepping your limits.&amp;nbsp; In the kernel shipped with Fedora 15, 10 controllers are active by default.&amp;nbsp; For more information, you can check the documentation:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Cgroups"&gt;http://en.wikipedia.org/wiki/Cgroups&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt"&gt;http://www.kernel.org/doc/Documentation/cgroups/cgroups.txt&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;If you are a Redhat customer, I find the RHEL6 manual has the best cgroups documentation out there.&lt;br /&gt;&lt;br /&gt;To see cgroups in action, use the &lt;a href="http://0pointer.de/public/systemd-man/systemd-cgls.html"&gt;systemd-cgls&lt;/a&gt; command found on Fedora 15.&amp;nbsp; This will print out the current hierarchy of all cgroups.&amp;nbsp; Here's what I see on my system (output truncated for display reasons):&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;├ condor&lt;br /&gt;│ ├ 17948 /usr/sbin/condor_master -f&lt;br /&gt;│ ├ 17949 condor_collector -f&lt;br /&gt;│ ├ 17950 condor_negotiator -f&lt;br /&gt;│ ├ 17951 condor_schedd -f&lt;br /&gt;│ ├ 17952 condor_startd -f&lt;br /&gt;│ ├ 17953 condor_procd -A /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 48...&lt;br /&gt;│ └ 18224 condor_procd -A /var/run/condor/procd_pipe.STARTD -R 10000000 -S 60 -C 48...&lt;br /&gt;├ user&lt;br /&gt;│ ├ root&lt;br /&gt;│ │ └ master&lt;br /&gt;│ │&amp;nbsp;&amp;nbsp; └ 6879 bash&lt;br /&gt;│ └ bbockelm&lt;br /&gt;│&amp;nbsp;&amp;nbsp; ├ 1168&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │ ├ 21426 sshd: bbockelm [priv]&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │ ├ 21429 sshd: bbockelm@pts/3&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │ ├ 21430 -bash&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │ └ 21530 systemd-cgls&lt;br /&gt;│&amp;nbsp;&amp;nbsp; ├ 309&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │ ├&amp;nbsp; 1110 /usr/libexec/gvfsd-http --spawner :1.4 /org/gtk/gvfs/exec_spaw/0&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │ ├&amp;nbsp; 6198 gnome-terminal&lt;br /&gt;│&amp;nbsp;&amp;nbsp; │ ├&amp;nbsp; 6202 gnome-pty-helper&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;(output trimmed) &lt;/pre&gt;&lt;pre&gt;└ system&lt;br /&gt;&amp;nbsp; ├ 1 /bin/systemd --log-level info --log-target syslog-or-kmsg --system --dump...&lt;br /&gt;&amp;nbsp; ├ sendmail.service&lt;br /&gt;&amp;nbsp; │ ├ 8603 sendmail: accepting connections&lt;br /&gt;&amp;nbsp; │ └ 8612 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue&lt;br /&gt;&amp;nbsp; ├ auditd.service&lt;br /&gt;&amp;nbsp; │ ├ 8542 auditd&lt;br /&gt;&amp;nbsp; │ ├ 8544 /sbin/audispd&lt;br /&gt;&amp;nbsp; │ └ 8552 /usr/sbin/sedispatch&lt;br /&gt;&amp;nbsp; ├ sshd.service&lt;br /&gt;&amp;nbsp; │ └ 7572 /usr/sbin/sshd&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;(output trimmed)&lt;/pre&gt;&lt;pre&gt;&lt;/pre&gt;&lt;br /&gt;All of the processes in my system are in the / cgroup; all login shells are placed inside a cgroup named &lt;br /&gt;&lt;pre&gt;/user/$USERNAME&lt;/pre&gt;; each system service (such as ssh) is located inside a cgroup named &lt;br /&gt;&lt;pre&gt;/system/$SERVICENAME&lt;/pre&gt;; finally, there's a special one named &lt;br /&gt;&lt;pre&gt;/condor&lt;/pre&gt;; More on &lt;br /&gt;&lt;pre&gt;/condor&lt;/pre&gt;later.&lt;br /&gt;&lt;br /&gt;To see the cgroups for the current process, you can do the following:&lt;br /&gt;&lt;pre&gt;[bbockelm@mydesktop ~]$ cat /proc/self/cgroup &lt;br /&gt;10:blkio:/&lt;br /&gt;9:net_cls:/&lt;br /&gt;8:freezer:/&lt;br /&gt;7:devices:/&lt;br /&gt;6:memory:/&lt;br /&gt;5:cpuacct:/&lt;br /&gt;4:cpu:/&lt;br /&gt;3:ns:/&lt;br /&gt;2:cpuset:/&lt;br /&gt;1:name=systemd:/user/bbockelm/1168&lt;br /&gt;&lt;/pre&gt;Note that each processes is not necessarily in one cgroup.  The rules are that a process can have one cgroup per mount, there is one or more controller per mount, and a controller can only be mounted once.&lt;br /&gt;&lt;br /&gt;Each controller has statistics accessible via proc.&amp;nbsp; For example, on Fedora 15, if I want to see how much memory all of my login shells are using, I can do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;[bbockelm@rcf-bockelman ~]$ cat /cgroups/memory/condor/memory.usage_in_bytes &lt;br /&gt;34365440&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;But what about the batch system?&lt;/span&gt;&lt;br /&gt;I hope our readers can see the immediate utility in having a simple mechanism for unescapable process tracking.&amp;nbsp; We examined one such mechanism before (adding a secondary GID per batch job), but it has a small drawback in that the secondary GID can be used to create permanent objects (files owned by the secondary GID) which outlive the lifetime of the batch job.&lt;br /&gt;&lt;br /&gt;But, even in Part I of the series, we concluded that a perfect process tracking mechanism is not enough: we also need to be able to kill processes when the batch job is finished!&amp;nbsp; The cgroups developer must have come to the same conclusion, as one controller is called the &lt;i&gt;freezer&lt;/i&gt;.&amp;nbsp; The freezer cgroup simply stops any process from receiving CPU time from the kernel.&amp;nbsp; All process in the cgroups are frozen - and there is no way for a process to know it is about to freeze, as they aren't informed via signals.&amp;nbsp; Hence, a process tracker can freeze the processes, send them all SIGKILL, and unfreeze them.&amp;nbsp; All processes will end immediately; none will have the ability to hide in the /proc system or spawn new children in a race condition.&lt;br /&gt;&lt;br /&gt;If you look at the first process tree posted, there is a cgroup called "condor".&amp;nbsp; As I presented at &lt;a href="http://www.cs.wisc.edu/condor/CondorWeek2011/presentations/bockelman-user-isolation.pdf"&gt;Condor Week 2011&lt;/a&gt;, condor is now integrated with cgroups.&amp;nbsp; It can be started in a cgroup the sysadmin specifies (such as /condor), and it will create a unique cgroup for each job (/cgroup/job_$CLUSTERID_$PROC_ID).&amp;nbsp; It uses whatever controllers are active on the system to try and track memory consumption, CPU time, and block I/O.&amp;nbsp; When the job ends or is killed, the freezer controller is used to clean up any processes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Conclusions&lt;/span&gt;&lt;br /&gt;As the disparate scientific clusters have become increasingly linked through the use of grids, improved process tracking has become more important.&amp;nbsp; Many sites have users from across the nation; it's no longer possible for a sysadmin to be good friends with each user.&amp;nbsp; Some have jobs with questionable quality; some have with virus-ridden laptops.&lt;br /&gt;&lt;br /&gt;In the end, traditional process tracking in batch systems is not really ready for modern users.&amp;nbsp; Most modern batch systems no longer rely solely on the original Unix grouping mechanisms, but will fall to user malicious users.&amp;nbsp; The problem is not solvable only from user space.&lt;br /&gt;&lt;br /&gt;Luckily, with the proc connector API (for any Linux 2.6 kernel) and cgroups (for recent Kernels), we can greatly improve the state of the art.&amp;nbsp; The folks contributing to the Linux kernel is broad, but I understand much of the contributions for cgroups has come from the OpenVZ folks: thanks guys!.&lt;br /&gt;&lt;br /&gt;As I've been exploring this subject, I have been implementing cgroup usage in Condor: I think it's a great new feature.&amp;nbsp; They will be released with Condor 7.7.0, due in a few days.&amp;nbsp; There's no reason other batch systems can't also adopt cgroups for process tracking: I hope the spread widely in the future!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-1891031173632648876?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/1891031173632648876/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/07/part-iii-bulletproof-process-tracking.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/1891031173632648876'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/1891031173632648876'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/07/part-iii-bulletproof-process-tracking.html' title='Part III: Bulletproof process tracking with cgroups'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-279082624499073271</id><published>2011-06-24T08:32:00.000-07:00</published><updated>2011-06-24T08:32:35.458-07:00</updated><title type='text'>Part II: Keeping a mindful eye on your users with ProcPolice.</title><content type='html'>&lt;a href="http://osgtech.blogspot.com/2011/06/how-your-batch-system-watches-your.html"&gt;In Part I of this series&lt;/a&gt;, we talked about the various mechanisms a batch system uses to track your job's processes, and concluded the state of the art isn't particularly impressive.&amp;nbsp; The only way to go is up; this post discusses an improved technique for process tracking in Linux.&amp;nbsp; It was motivated by &lt;a href="http://netsplit.com/2011/02/09/the-proc-connector-and-socket-filters/"&gt;this blog post&lt;/a&gt; from the author of upstart.&amp;nbsp; If you feel inspired here, and would like to read some code, it is highly recommended reading.&lt;br /&gt;&lt;br /&gt;The previous post went from bad to dire: most batch systems use methods that are easily defeated by the job changing its runtime environment (altering the process group, reparenting to init, or changing the environment).  Even when using a reliable tracking method, killing an arbitrary set of processes is not possible.&lt;br /&gt;&lt;br /&gt;To top it off - when process tracking or killing goes awry, we have no reliable means to detect when our methods fail.&lt;br /&gt;&lt;br /&gt;There's a small, relatively unknown, corner of the Linux kernel that can help us out, the proc connector.  A privileged process can connect a socket to the kernel, and receive a stream of messages about processes on the system.  Any time one of the following system calls happens:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;fork/clone&lt;/li&gt;&lt;li&gt;exec&lt;/li&gt;&lt;li&gt;exit&lt;/li&gt;&lt;li&gt;setuid&lt;/li&gt;&lt;li&gt;setgid&lt;/li&gt;&lt;li&gt;setsid&lt;/li&gt;&lt;/ul&gt;for a thread or a process (all the events are documented in linux/cn_proc.h in the kernel's sources), the socket receives a message containing all the relevant event details.&amp;nbsp; By tracking only the the fork and exit events, one can build a process tree in memory, starting with the batch system worker process.&lt;br /&gt;&lt;br /&gt;Because it is based on events from the kernel, not periodic polling of /proc from user space, this is a far more reliable method for tracking processes.&amp;nbsp; With a little help from the kernel, the picture is already brighter!&lt;br /&gt;&lt;br /&gt;The drawback here is that, while the message is being processed by user-space, further messages to the socket are buffered in memory.&amp;nbsp; When the buffer is full, the kernel drops any further messages: the tracker will lose possibly important events.&amp;nbsp; The event stream is asynchronous: the fork or exit occurs regardless of whether you process the associated message..&amp;nbsp; Unless the tracking code is particularly slow, it is likely the &lt;i&gt;only&lt;/i&gt; case where the buffer overruns is the case where we care about: someone launching a fork bomb to escape the system.&lt;br /&gt;&lt;br /&gt;If you have too many message, the first step is to receive less messages.&amp;nbsp; One can hand the kernel a small program in an special assembly language that pre-filters messages: a message that isn't put into the queue can't overflow it!&amp;nbsp; Writing these filters are a fun academic exercise, but not useful: when a fork bomb occurs, the messages the processes needs to receive are precisely the ones overflowing the buffer!&lt;br /&gt;&lt;br /&gt;So while never failing is preferable, detecting when we have failed is acceptable: when the buffer is full, the next attempt to read a message from the socket will return a ENOBUF to indicate the buffer has overfilled.&amp;nbsp; Actions can be taken: for most batch systems, a nasty email to the user and sysadmin might be sufficient.&amp;nbsp; If you work for the NSA, perhaps the appropriate response is to power down the worker node and send out black helicopters.&lt;br /&gt;&lt;br /&gt;I've taken the approach outlined here and turned it into a small package called "ProcPolice".&amp;nbsp; It consists of a simple daemon which listens to the stream of events, and adds the process to an in-memory tree if the process can be tracked back to a batch system job.&amp;nbsp; ProcPolice will detect when a process reparents to init and, if it is launched from the batch system and non-root, it can log the event or kill the newly-daemonized process.&amp;nbsp; In testing, it is able stop off simple fork-bombs and detect more sophisticated ones.&lt;br /&gt;&lt;br /&gt;As ProcPolice runs as a separate daemon that watches the batch system and intervenes only on daemonized processes, it can be used immediately with any batch system (Condor or PBS have been tested).&amp;nbsp; ProcPolice is available in source code form from&lt;br /&gt;&lt;blockquote&gt;svn://t2.unl.edu/brian/proc_police&lt;/blockquote&gt;Or as a &lt;a href="http://t2.unl.edu/store/repos/nebraska/5/nebraska/x86_64/proc_police-0.0.3-1.x86_64.rpm"&gt;RHEL5-compatible RPM&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;ProcPolice was invented with a few specific requirements in mind:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Prevent batch system-based processes from outliving the lifetime of the batch system job without changing the runtime of the job itself.&lt;/li&gt;&lt;li&gt;Do this without support in the batch system itself.&lt;/li&gt;&lt;li&gt;Detect when failures occur.&lt;/li&gt;&lt;li&gt;Support RHEL5 (the OS used by the LHC for the next few years).&lt;/li&gt;&lt;/ol&gt;It turns out the last requirement is perhaps the most stringent one; newer kernels have a specific feature for tracking and controlling arbitrary sets of processes.&amp;nbsp; This is the topic of the next part of this series.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-279082624499073271?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/279082624499073271/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/06/part-ii-keeping-mindful-eye-on-your.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/279082624499073271'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/279082624499073271'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/06/part-ii-keeping-mindful-eye-on-your.html' title='Part II: Keeping a mindful eye on your users with ProcPolice.'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-6501910750059634875</id><published>2011-06-15T12:39:00.000-07:00</published><updated>2011-06-15T12:39:59.715-07:00</updated><title type='text'>Future Computing in Particle Physics Workshop</title><content type='html'>(Taking a break from working on the next post in the series; it's about half-done, expect it before I head out on vacation this week.&amp;nbsp; For now, I'll make a note about life in the field.)&lt;br /&gt;&lt;br /&gt;I've been invited to talk at the Future Computing in Particle Physics Workshop in Edinburgh, which has the following abstract:&lt;br /&gt;&lt;blockquote&gt;Recent developments in computing and software architectures have resulted in huge potential for accelerating applications used in experimental particle physics. This is an ideal time to investigate how a significant performance boost can be achieved by the effective use of many-core and GPU architectures in a distributed computing environment, as well as utilising emerging I/O and storage technologies. This workshop aims to discuss what has been done so far in the field and what potential future development areas are feasible.&lt;br /&gt;&lt;a href="https://indico.cern.ch/conferenceDisplay.py?confId=141309"&gt;https://indico.cern.ch/conferenceDisplay.py?confId=141309&lt;/a&gt;&lt;/blockquote&gt;It's an exciting workshop; the downside is that it started today and I'm on the wrong side of the Atlantic!&amp;nbsp; Thus, I have the pleasure of attending via videoconference.&amp;nbsp; While it doesn't truly replace attending, a conference - we all swear half the value of these conference are the discussions that occur during break - there's a few things I've learned:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;No matter how early your presentation is in your timezone, show up early and ask questions on other presentations.&amp;nbsp; Besides the obvious good etiquette (if you don't plan on paying attention, decline the invitation), this allows you to test out the quality of the videoconference setup.&lt;/li&gt;&lt;li&gt;Find a friend sitting in the remote audience to IM you during the presentation.&amp;nbsp; When you're physically there, you can gauge interest levels from the audience's body language.&amp;nbsp; Are they bored?&amp;nbsp; Can they hear/see you?&amp;nbsp; Having a spy in the audience helps you get this feedback.&lt;/li&gt;&lt;li&gt;My father always says "I can hire a monkey to stand up and read off Powerpoint slides.&amp;nbsp; They are here to hear you present".&amp;nbsp; The adage is still partially true, but a larger-than-normal part of the information conveyed to the audience is going to go through these slides.&amp;nbsp; Spend some extra time on them.&lt;/li&gt;&lt;/ul&gt;Unfortunately, while the presentations and audio were excellent, the "Whisky Tasting Welcome Reception" doesn't translate well to videoconferencing.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;Now, onto the subject of the workshop: future of computing in particle physics.&amp;nbsp; I'll be talking about I/O.&amp;nbsp; Really, it all boils down to two things:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;There is no magic bullet to make I/O faster.&amp;nbsp; For what I can tell, the limitation is the complexity of our data structures.&amp;nbsp; Improvements to the current I/O stack - or a new I/O stack - isn't likely going to turn bad data structures into good ones. &lt;/li&gt;&lt;li&gt;We demand remote I/O!&amp;nbsp; Having batch system access to the wealth of data is great... but it's time to have the ability to do remote I/O also.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-6501910750059634875?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/6501910750059634875/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/06/future-computing-in-particle-physics.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/6501910750059634875'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/6501910750059634875'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/06/future-computing-in-particle-physics.html' title='Future Computing in Particle Physics Workshop'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-9119531119030852746</id><published>2011-06-13T08:35:00.000-07:00</published><updated>2011-06-13T18:33:48.994-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><category scheme='http://www.blogger.com/atom/ns#' term='Condor'/><category scheme='http://www.blogger.com/atom/ns#' term='batch system'/><category scheme='http://www.blogger.com/atom/ns#' term='kernel'/><category scheme='http://www.blogger.com/atom/ns#' term='accounting'/><title type='text'>Part I: How your batch system watches your processes (and why it's so bad at it)</title><content type='html'>&lt;span style="font-size: large;"&gt;Series Preamble &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Almost every cluster sysadmin has faced a case of "users gone wild"; for us, it's almost always due to users abusing the shared file system or user processes escaping the watchful eye of the batch system.&amp;nbsp; If I could prevent abuse of the shared file system while keeping it functional, I'd be a rich man.&amp;nbsp; I'm not a rich man, so I'm going to be talking about the latter issue.&amp;nbsp; This is a big topic, so I'm going to be splitting it up into a few posts:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Part I: How your batch system watches your processes (and why it's so bad at it).&lt;/li&gt;&lt;li&gt;Part II: Keeping a mindful eye on your users with ProcPolice.&lt;/li&gt;&lt;li&gt;Part III: Death of the fork-bomb: Ironclad process tracking in batch systems.&lt;/li&gt;&lt;/ul&gt;A few caveats up-front: I'm going to be talking about the platform I know (Linux-based OS's) and the batch systems we use (Condor, PBS, and a bit of SGE).&amp;nbsp; Apologies to the Windows/obscure-Unix-variant/LSF users out there.&lt;br /&gt;&lt;br /&gt;So, onward and upward!&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Strategy: Process Groups&lt;/span&gt;&lt;br /&gt;Each process on the system belongs to a process group, and the process groups are further grouped into a session (as in, a login session).&amp;nbsp; Most batch systems, when starting a job, will start the job in a new session and a fresh process group.&amp;nbsp; Process groups are at their most useful when sending signals: the batch system can send a signal (such as SIGKILL to terminate processes) to a process group.&amp;nbsp; The kernel does the process tracking and appropriately signals all the processes in a group.&lt;br /&gt;&lt;br /&gt;If this worked well, it would be a short blog series.&amp;nbsp; Unfortunately, any processes can start a new process group - removing it from its previous group.&amp;nbsp; To hide from the batch system, one can simply daemonize and start a new process group.&amp;nbsp; Viola!&amp;nbsp; You've now escaped. &lt;br /&gt;&lt;br /&gt;Process groups are utilized by all batch systems; despite not working well, this is the most straightforward mechanism.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Strategy: Process Trees&lt;/span&gt;&lt;br /&gt;Another&amp;nbsp; "process trees".&amp;nbsp; When a process is created, it has both a process ID and parent ID.&amp;nbsp; By reading from the /proc filesystem (side note: Mac OS X has &lt;a href="http://developer.apple.com/library/mac/#qa/qa2001/qa1123.html"&gt;a far superior way&lt;/a&gt; of getting a programmatic snapshot of all system processes.&amp;nbsp; Jealous.), the batch system can create a list of all processes on the system, their parents, and builds an in-memory tree of the parent-child relationships.&amp;nbsp; By starting with the process the batch system launched for the job, and walking through the tree, the batch system can determine all the processes associated with a job.&amp;nbsp; Creating and updating a process tree is poll-based: the operation is performed every X seconds (a typical value might be X=5).&amp;nbsp; As long as the lifetime of all processes is X or more, the batch system will see all processes. &lt;br /&gt;&lt;br /&gt;Well-behaved batch jobs stay within the process tree started by the batch system; this way, they get monitored by the batch system and their resource usage gets accounted.&lt;br /&gt;&lt;br /&gt;The issue here, again, is badly behaved processes.&amp;nbsp; When a process's parent exits, the child is assigned a new parent: init, or PID 1.&amp;nbsp; To escape notice of a batch system that uses process trees, one needs to fork twice (creating a child and "grandchild"), then have the immediate child die.&amp;nbsp; The grandchild is re-parented to init, as if it was a system daemon.&amp;nbsp;&amp;nbsp; Done quickly enough, and the grandchild has escaped the batch system.&lt;br /&gt;&lt;br /&gt;The process tree strategy is used by Condor.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Strategy: Environment Cookies&lt;/span&gt;&lt;br /&gt;A process's Unix environment is automatically inherited by its children, and remains unchanged if the parent exits.&amp;nbsp; Condor currently takes advantage of these facts and inserts an extra environment variable into each batch system job.&amp;nbsp; If you dump the environment of your current job using "env", you might see something like this:&lt;br /&gt;&lt;br /&gt;&lt;blockquote style="font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;"&gt;_CONDOR_ANCESTOR_17948=17952:1307975354:2631244213&lt;br /&gt;_CONDOR_ANCESTOR_17952=18260:1307976308:2791283533&lt;br /&gt;_CONDOR_ANCESTOR_18260=18263:1307976309:1204008886&lt;/blockquote&gt;&lt;br /&gt;Each of these are environment variables used by Condor to track the process's ancestry.&amp;nbsp; In this case, the condor_starter's PID is 18260 and the job's PID is 18263 (the other entries are from parents of the condor_starter process, the condor_startd and condor_master).&amp;nbsp; Any sub-process started by the job will retain the _CONDOR_ANCESTOR_18260 variable by default.&lt;br /&gt;&lt;br /&gt;When Condor polls the /proc filesystem to build a process tree, it can also read out the environment variables and use this information to build the process tree.&amp;nbsp; As before, this relies on the user being friendly: if the environment variables are changed, then it again can escape the batch system.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Strategy: Supplementary Group IDs&lt;/span&gt;&lt;br /&gt;Notice that all strategies so far involve some property of the process which is automatically inherited by its children (the process group, the process ancestry, or the Unix environment variables), but can be changed by the user's job.&lt;br /&gt;&lt;br /&gt;A property inherited by subprocesses that cannot be changed without special privilege is the set of group IDs.&amp;nbsp; Each process has a set of group IDs it is associated with it (if you look at the contents of /proc/self/status, you can see the groups associated with your terminal); it requires administrator privileges to add or remove group IDs, which the batch system has but the user does not.&lt;br /&gt;&lt;br /&gt;Condor and SGE can be assigned a range of group IDs to hand out, and assign one of the IDs to the job process they launch.&amp;nbsp; &lt;i&gt;Assuming&lt;/i&gt; there is only one instance of the batch system on the node, any process with that group ID must have come from the batch job.&amp;nbsp; So, when it comes time to kill batch jobs or perform accounting, we can map any process back to the batch system job.&lt;br /&gt;&lt;br /&gt;While the user process cannot get rid of the ID, this setup is still possible to defeat (discussed below), and has a few drawbacks.&amp;nbsp; The user process now has a new GID, and can create files using that GID; I have no clue how this might be useful, but it's a sign of misusing the GID concept.&amp;nbsp; Anything that caches the user-to-groups mapping may get the wrong set of GIDs (as having unique per-process GIDs are rare, these caches may have broken assumptions).&amp;nbsp; Finally, lays extra work on the sysadmin, who now must maintain a range of unused GIDs; they must&amp;nbsp; sufficient to provide a GID per batch slot.&amp;nbsp; Locally, we've run into the fact that the number of GIDs increases with the number of cores per node: what was a good setting last year is no longer sufficient.&lt;br /&gt;&lt;br /&gt;Note that, with Condor, you can take this one step further and assign a unique user ID per batch slot, and run the job under that UID as opposed to the submitter's UID.&amp;nbsp; This is a nightmare in terms of NFS-based shared file systems, but the approach at least works on both Unix and Windows.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;How to defeat your batch system (inadvertently, right?)&lt;/span&gt;&lt;br /&gt;Despite the drawbacks, the supplementary GID mechanism seems pretty foolproof: the user can no longer launch processes that can't be tracked back to a batch slot.&amp;nbsp; However, this isn't sufficient to stop malicious users.&lt;br /&gt;&lt;br /&gt;In order to kill all processes based on some attribute of the process (besides the process group), one must iterate through the contents of the /proc directory, read and parse the process's status file, and send a kill signal as appropriate.&amp;nbsp; Ultimately, all batch systems currently do some variation of this; if you want a simple source code example, go lookup the sources of the venerable 'killall' utility.&lt;br /&gt;&lt;br /&gt;The approach described above does have a fatal flaw: it is not atomic.&amp;nbsp; Between looking at the contents of /proc, and opening /proc/PID/status, a process could have already forked another child and exited.&amp;nbsp; Processes may have been spawned between the time when the directory iteration begins and ends, meaning they might never be seen.&lt;br /&gt;&lt;br /&gt;Hence, a process may spawn more children in the time the batch system iterates through /proc and kills it; in fact, if the batch system is unlucky, they may do this fast enough the batch system may never detect the process exists in the first place!&amp;nbsp; In the latter case, regardless of the tracking mechanism, the process may escape the batch system.&lt;br /&gt;&lt;br /&gt;Worse, because these short-lived processes can be invisible to the batch system, the batch system may not detect it's being fooled; if the batch system could reliably detect the attack, it might be able to send an alert or turn off the worker node.&lt;br /&gt;&lt;br /&gt;Ultimately, the batch system is defeated because it is trying to do process control from user-space.&amp;nbsp; We lack three things:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Reliably track processes without changing the semantics of the job's runtime environment.&lt;/li&gt;&lt;li&gt;Atomically operations for determining and signaling a set of processes.&lt;/li&gt;&lt;li&gt;Detecting when (1) or (2) have failed.&lt;/li&gt;&lt;/ol&gt;Luckily, with a little help from the Linux kernel, we can overcome all three of the above issues.&amp;nbsp; Item (2) takes a fairly modern kernel (2.6.24 or later), but items (1) and (3) can be accomplished with 2.6.0 or later.&lt;br /&gt;&lt;br /&gt;As long as we have the ability to detect attacks as in (3), we can limp along until everyone gets onto a modern kernel: this is the topic of the next post.&amp;nbsp; Stay tuned.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-9119531119030852746?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/9119531119030852746/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/06/how-your-batch-system-watches-your.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/9119531119030852746'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/9119531119030852746'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/06/how-your-batch-system-watches-your.html' title='Part I: How your batch system watches your processes (and why it&apos;s so bad at it)'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8803173202887660937.post-317205796701176586</id><published>2011-06-06T17:19:00.000-07:00</published><updated>2011-06-06T17:19:43.091-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Introductions'/><category scheme='http://www.blogger.com/atom/ns#' term='Investigations'/><category scheme='http://www.blogger.com/atom/ns#' term='Blueprint'/><title type='text'>Introductions</title><content type='html'>This is the (humble) beginnings of the mostly-official blog of the OSG Technology Area.&amp;nbsp; As such, I feel it's only appropriate to start with the "who" and "what" we are.&amp;nbsp; Unfortunately, that makes the opening post of this blog rather dry... &lt;b&gt;it'll pick up, I swear&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;First, the "what":&amp;nbsp; The OSG Technology Area provides the OSG with a mechanism for long-term technology planning.&amp;nbsp; We do this through two sub-groups:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Blueprint&lt;/b&gt;: Recording the conceptual principles of the OSG and focusing on the long-term evolution of the OSG.&amp;nbsp; The Blueprint group tries to meet approximately quarterly and, under the direction of the OSG Technical Director, updates the "Blueprint Document" to reflect our understandings of the basic principles, definitions, and the broad outlines of how the pieces fit together.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Investigations&lt;/b&gt;: We're all surrounded by a continuing onslaught of technologies.&amp;nbsp; Some are fantastic.&amp;nbsp; Some are not-so-great.&amp;nbsp; Some are great, but not what we really needed.&amp;nbsp; In order to manage the way forward - while keeping to the OSG principles - this group does investigations to understand the concepts, functionality, and impact of external technologies.&amp;nbsp; The point is to identify items that are potentially disruptive in the medium-term of 12-24 months.&lt;/li&gt;&lt;/ul&gt;Now, the "who":&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;John Hover&lt;/b&gt;: Leader of the Grid Group at Brookhaven National Lab.&amp;nbsp; John heads up the Blueprint activity.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Ashu Guru&lt;/b&gt;: Applications specialist at the Holland Computing Center (HCC), based at the University of Nebraska-Lincoln.&amp;nbsp; Ashu works for Technology Investigations; currently, he's focusing on how virtual machines may be mixed with "traditional" batch systems.&amp;nbsp; This will be subject of another post.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Brian Bockelman&lt;/b&gt; (me!):&amp;nbsp; Yet another HCC staff member.&amp;nbsp; I've been working heavily in the Technology Area this year; I spent lots of time participating in the rewrite the OSG Blueprint, and am leading the Technology Investigations group.&lt;/li&gt;&lt;/ul&gt;I'm not going to lie: I have no blogging experience.&amp;nbsp; However, I am willing to plunge in head-first; please excuse me if I fumble around with the technology a bit (don't worry - I feel much more at home with distributed high throughput computing than social media).&amp;nbsp; I've got big plans for this blog: I'm hoping to rope the other members of the Technology Area into doing guest posts, pushing for a vibrant OSG-related blogging community, and trying to have this be one of the better outposts for distributed high throughput computing on the internet.&lt;br /&gt;&lt;br /&gt;But before all that - I hope you enjoyed the introductions.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8803173202887660937-317205796701176586?l=osgtech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://osgtech.blogspot.com/feeds/317205796701176586/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://osgtech.blogspot.com/2011/06/introductions.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/317205796701176586'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8803173202887660937/posts/default/317205796701176586'/><link rel='alternate' type='text/html' href='http://osgtech.blogspot.com/2011/06/introductions.html' title='Introductions'/><author><name>Brian Bockelman</name><uri>http://www.blogger.com/profile/03652101135146911311</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='30' src='http://2.bp.blogspot.com/-n-85Ok3F7cs/Te06LWQlAZI/AAAAAAAAAQg/JKJznEn0V00/s220/20772_690010681073_17211841_39591976_7783162_n.jpg'/></author><thr:total>0</thr:total></entry></feed>
