OSG Technology Area Rumblings: What's the hold-up?

Do you have the following diagram memorized?

If your site runs Condor, you probably should. It shows the states of the condor_startd, the activities within the state, and the transitions between them. If you want to have jobs reliably pre-empted (or is that killed? Or vacated?) from the worker node for something like memory usage, a clear understanding is required.

However, the 30 state transitions might be a bit much for some site admins who just want to kill jobs that go over a memory limit. In such a case, admins can utilize the SYSTEM_PERIODIC_REMOVE or the SYSTEM_PERIODIC_HOLD configuration parameters on the condor_schedd to respectively remove or hold jobs.

These expressions periodically evaluate the schedd's copy of the job ClassAd (by default, once every 60s); if they evaluate to true for a given job, they will remove or hold it. This will almost immediately preempt execution on the worker node.

[Note: While effective and simple, these are not the best way to accomplish these sort of policies! As the worker node may talk to multiple schedd's (via flocking, or just through a complex pool with many schedd's), it's best to express the node's preferences locally.]

At HCC, the periodic hold and release policy looks like this:

# hold jobs using absurd amounts of disk (100+ GB)
SYSTEM_PERIODIC_HOLD = \
   (JobStatus == 1 || JobStatus == 2) && ((DiskUsage > 100000000 || ResidentSetSize > 1600000))

# forceful removal of running after 2 days, held jobs after 6 hours,
# and anything trying to run more than 10 times
SYSTEM_PERIODIC_REMOVE = \
   (JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*6) || \
   (JobStatus == 2 && CurrentTime - EnteredCurrentStatus > 3600*24*2) || \
   (JobStatus == 5 && JobRunCount >= 10) || \
   (JobStatus == 5 && HoldReasonCode =?= 14 && HoldReasonSubCode =?= 2)

We place anything on hold that goes over some pre-defined resource limit (disk usage or memory usage). Jobs are removed if they have been on hold for a long time, have run for too long, have restarted too many times, or are missing their input files.

Note that this is a flat policy for the cluster - heterogeneous nodes with larges amounts of RAM per core would not be well-utilized. We could tweak this by having users utilize the RequestMemory attribute to their job's ad (defaulting to 1.6GB), place into the Requirements that the slot have sufficient memory, and have the node only accept jobs that request memory below a certain threshold. The expression above could then be tweaked to hold jobs where (ResidentSetSize > RequestMemory). Perhaps more on that in the future if we go this route.

While the SYSTEM_PERIODIC_* expressions are useful, Dan Bradley recently introduce me to the SYSTEM_PERIODIC_*_REASON parameter. This allows you to build a custom hold message for the user whose jobs you're about to interrupt. The expression is evaluated within the context of the job's ad, and the resulting string is placed in the job's HOLD_REASON. As an example, previously, the hold message was something bland and generic:

The SYSTEM_PERIODIC_HOLD expression evaluated to true.

Why did it evaluate to true? Was it memory or disk usage? When it was held, how bad was the disk/memory usage? These things can get lost in the system. Oops. We added the following to our schedd's configuration:

# Report why the job went on hold.
SYSTEM_PERIODIC_HOLD_REASON = \
   strcat("Job in status ", JobStatus, \
   " put on hold by SYSTEM_PERIODIC_HOLD due to ", \
   ifThenElse(isUndefined(DiskUsage) || DiskUsage < 100000000, \
      strcat("memory usage ", ResidentSetSize), \
      strcat("disk usage ", DiskUsage)), ".")

Now, we have beautiful error messages in the user's logs explaining the issue:

Job in status 2 put on hold by SYSTEM_PERIODIC_HOLD due to memory usage 1620340."

One less thing to get confused about!

OSG Technology Area Rumblings

Thursday, December 29, 2011

What's the hold-up?

No comments:

Post a Comment