skip to primary navigationskip to content
 

Resource Usage and Cgroups

One has been able to limit resource usage by users in HTCondor for quite some time by using Linux's ulimit command. This is nice and simple, but not very effective. The flaw is that resource limits imposed this way are per process, not per job,  and since jobs are often composed of many Unix processes these can slip under the radar and collectively use up more resources than the job was entitled to. Moreover, the memory limit only applies to the virtual memory size, not the physical memory size, or the resident set size. This can be a problem for jobs that use the mmap system call to map in a large chunk of virtual memory, but only need a small amount of memory at one time. Typically, the resource the administrator would like to control is physical memory, because when that is in short supply, the machine starts paging, and can become unresponsive very quickly.

Fortunately, since Linux kernel version 2.6.24 there has existed the Control Group (abbreviated to cgroups) feature, which gives a more effective way of limiting resource usage as it does it for whole process groups. Note that even if cgroup support is built into the kernel, many distributions do not install the necessary cgroup tools by default. On RPM-based systems, these can be installed with the command:

yum install libcgroup\*

whereas on Debian based systems you can do:

apt-get install cgroup-bin

Note that some Linux distributions automatically activate cgroups, e.g. Ubuntu, whereas others don't, e.g. Debian. Matters are further muddied because some distributions put the cgroup subsystems under separate directories (e.g. Ubuntu) whereas others put them all in the same directory (e.g. Debian, in /sys/fs/cgroup ). In order to activate the cgroups as required by HTCondor we'd first create the file /etc/cgconfig.conf, but its contents will depend on how that distribution organises the cgroup hierarchy as just discussed. For example, for Debian we'd have:

mount {
        cpu	= /sys/fs/cgroup;
        cpuset	= /sys/fs/cgroup;
        cpuacct = /sys/fs/cgroup;
        memory  = /sys/fs/cgroup;
        freezer = /sys/fs/cgroup;
        blkio   = /sys/fs/cgroup;
}

group htcondor {
  cpu {}
  cpuacct {}
  memory {}
  freezer {}
  blkio {}

  cpuset {
    cpuset.cpus = 0-7; # For eight cores. Adjust accordingly
    cpuset.mems = 0;
  }
}

A further Debian strangeness is that although the memory subsystem is built into the kernel, it is not automatically activated. One needs to invoke this via a kernel boot option (and rebooting), e.g. by setting in /etc/default/grub (and don't forget to run update-grub):

GRUB_CMDLINE_LINUX="cgroup_enable=memory"

We're nearly there, but not quite. See that "cpuset.mems = 0" setting above? We need that otherwise that field will be empty, causing attempts by HTCondor to attach the process to the relevant cgroup to fail, which shows up in the ProcLog as the message:

Cannot attach pid xxxx to cgroup ... No space left on device

This is because HTCondor creates a hierarchy of subdirectories, one for each job, under /path/to/cgroup/htcondor, and that cpuset.mems field is not inherited by default, which leaves it empty. To counter this we fix it just after we create the top cgroup hierarchy, which can all be achieved by having in /etc/rc.local:

/usr/sbin/cgconfigparser -l /etc/cgconfig.conf
/bin/echo 1 > /sys/fs/cgroup/htcondor/cgroup.clone_children

Having got our cgroups configured, we now need to tell HTCondor to use them (this requires a condor_procd daemon to be running, so won't work if you've disabled that feature). We achieve this by having two entries in the relevant HTCondor configuration file, one telling HTCondor to use cgroups, and the other telling it what sort of resource limiting policy to use:

BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft   # Can also be "none" or "hard". The default is "soft"