These notes describe some issues that CamGrid pool administrators need to be aware of when upgrading their Condor pools from the 8.0 old-stable branch of HTCondor to the 8.2 stable branch. I'll start with a summary of the test results that before listing the tests themselves. I'll then mention some of the more relevant new features that this new stable branch introduces.
This branch represents another round of minor improvements and new functionality, with most changes seemingly being internal speed-ups. There's noticeable increased support for using GPUs with HTCondor, but I doubt this is of relevance on CamGrid. Partitionable slots appear to work better under the Parallel Universe than before, but if you want to run multi-machine parallel jobs then you should stick to static slots.
The table below lists the tests that were run, together with their outcome. I considered them to be broadly representative of the functionality that's of most importance on CamGrid. These tests were carried out using Condor 8.2.7, and all MPI tests used OpenMPI 1.4.5. All machines were running Debian 7.8.
|1) Simple Vanilla Universe job||Pass|
|2) Vanilla Universe job with file transfer||Pass|
|3) Standard Universe job||Pass||SU support added for Ubuntu 14.04, but need specific HTCondor build.|
|4) Multiple long jobs to test dynamic slots||Pass|
|5) SMP/MPI via Parallel Universe - static slots||Pass|
|6) SMP/MPI via Parallel Universe - dynamic slots||Pass||Dynamic slot stays claimed unless set "CLAIM_WORKLIFE = 0" on execute host.|
|7) SMP/MPI via Vanilla Universe||Pass|
|8) Multi-host MPI job via Parallel Universe - static slots||Pass||Need to modify last line of $CONDOR_HOME/libexec/condor_ssh|
|9) Multi-host MPI job via Parallel Universe - dynamic slots||Pass||Need to get combination of request_cpus and machine_count right. Very fiddly, especially if core count is different on different machines. Best stick to static slots for multi-host jobs.|
|10) Multi-host MPI job via Parallel Universe - mixed slots||Pass||See previous comment.|
|11) Use of Vanilla Universe file viewer||Pass||Needs the JOB_EXECDIR_PERMISSIONS configuration variable to be set to WORLD.|
|12) Test condor_ssh_to_job||Pass|
|13) Test DAGMan||Pass|
|14) Flocking jobs||Pass|
- Better ganglia support.
- The new configuration variable FILE_TRANSFER_DISK_LOAD_THROTTLE enables dynamic adjustment of the level of file transfer concurrency in order to keep the disk load generated by transfers below a specified level.
- The new condor_gpu_discovery tool detects CUDA and OpenCL GPUs, reporting them in the format needed to configure GPU resources using the configuration variable MACHINE_RESOURCE_INVENTORY_GPUs.
- HTCondor can now discover, schedule, and manage GPUs in an "exceedingly simple way" by inserting:
use feature : GPUsin the configuration file.
- Two new pre-defined configuration variables are referenced with $(DETECTED_PHYSICAL_CPUS) and $(DETECTED_CPUS). $(DETECTED_PHYSICAL_CPUS) contains the number of physical (non-hyperthreaded) CPUs. $(DETECTED_CPUS) will match the value of either DETECTED_CORES or DETECTED_PHYSICAL_CPUS, depending on the state of COUNT_HYPERTHREAD_CPUS. The default value of NUM_CPUS now defaults to the value of DETECTED_CPUS.
- The new configuration variable NETWORK_HOSTNAME sets the host name that HTCondor uses to identify the local machine. If NETWORK_HOSTNAME is not set, then HTCondor uses the gethostname() function to determine the machine's host name. This variable is useful if a machine has multiple network interfaces with different host names.
- The default mailer has been switched to sendmail, because HTCondor's interactions with mailx could lead to privilege escalation.