Condor uses parallel universe in order to submit MPI-enabled jobs. In fact, the parallel universe just finds N machines, and then launches a program on all N of them at the same time. If those programs need to communicate, it's the job of a wrapper script to setup anything that is needed. Condor includes some tools to make it easier for the wrapper script to do that, and provides such scripts for OpenMPI and MPICH2. However, these tend to be for jobs that span multiple machines with a shared filesystem, which as we explain below is not suitable for CamGrid. Note that flocking does not work for parallel universe (MPI) jobs, so if you intend using MPI facilities in a CamGrid pool other than yours then you'll have to make specific arrangements with that pool's administrator to schedule your jobs onto it.
MPI and SMP machines
Although we can configure Condor to run MPI jobs over an arbitrary number of hosts, I shall keep in mind local conditions within CamGrid and note that individual machines are, at best, connected via gigabit ethernet. Since inter-machine communication will be slow, and the lack of a shared filesystem makes such operation even more awkward, I shall only consider MPI jobs running on single SMP machines. This is not as bad as it sounds, as CamGrid now possesses a large number of multi-processor, multi-core hosts and this situation will tend to prevail in the future. Anyone who needs to use large numbers of processors for a single job should perhaps be thinking in terms of a high performance computing facility rather than a campus grid, though if you still would like to use Condor for multi-host MPI jobs then contact your CO in the first instance (sysadmins: note that the wrapper scripts that come bundled with Condor for inter-host MPI operation assume a shared file system. You may want to talk with me to discuss operating without a shared file system). The rest of this article will deal with MPI/SMP jobs on multiple cores of one physical host.
We now want to make sure that individual jobs do not fragment onto different machines. This can be achieved by using different ParallelSchedulingGroups for each host, so in the condor_config.local of each execute machine we put:
ParallelSchedulingGroup = "$(HOSTNAME)" DedicatedScheduler = "DedicatedScheduler@your.dedicated.submit.host" STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler, ParallelSchedulingGroup RANK = Scheduler =?= $(DedicatedScheduler)
Note that the execute hosts need to nominate a dedicated scheduler (i.e. submit host), which is a Condor idiosyncrasy for MPI operation (see the manual).
We now come to the user's submit file, and we'll demonstrate it using an example. If I'm launching a 4-process DL_POLY3 job that has executable DL_POLY.Y and uses the input files CONFIG, CONTROL and FIELD, then a suitable submit script can look like:
universe = parallel executable = mpi-script arguments = DLPOLY.Y machine_count = 4 should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = DLPOLY.Y, CONFIG, CONTROL, FIELD +WantParallelSchedulingGroups = True notification = never log = log error = err output = out queue
The wrapper mpi-script is pretty simple for our single-host SMP operation, and an example of work that was carried out using this method has been published in the following Science article. Below we give examples for MPICH2 and OpenMPI.
Here's the wrapper for MPICH2, which we call . The machine file (machfile) mentioned in mp2script.smp has a particularly simple form for our SMP machines, and contains the single entry hostname:NumberOfVMs, e.g. "valletta--escience:4".
As mentioned in the opening paragraph, in the parallel universe Condor simply starts N identical jobs on N separate slots, and we leverage this to run MPI jobs by exiting from N-1 slots and set off the MPI head node on the remaining slot. The problem with this is that any file transfer occurs before our wrapper script is invoked (naturally, since this has to be transferred too), so if we're not using a shared file system then all N slots will receive copies of the input files and executable (if requested) . This can constitute a considerable waste of network bandwith, as well as disk space, for slots residing on the same physical machine. Ideally Condor should only transfer one copy of these files per machine, but currently such support does not exist. Hence, your choices are 1) live with this "feature", 2) run your own user-land shared file system, e.g. via Parrot, 3) arrange for all files to be pre-staged on the execute nodes, 4) don't request any files to be transferred, but instead put them on an ftp/http/whatever server, and when the wrapper script starts up get one slot per host to pull them over, e.g. using wget or curl, and symlink from the other slots on that host before the MPI command is executed, 5) if you can just use the cores of one physical host, using SMP/MPI, then you can avoid using the parallel universe altogether. Instead, use the vanilla universe in conjunction with dynamic slots, requesting as many cores for that slot as MPI processes.
Setting up MPI under Condor can be a bit tricky, but take heart from the fact that there are quite a few people using it successfully on CamGrid, mostly in non-SMP (i.e. ssh internode) operation. Members of CamGrid are welcome to contact me for technical support. Non-members of CamGrid are requested to direct their queries for help to the condor-users mailing list, as unfortunately I do not have the spare time to field such requests.