skip to primary navigationskip to content
 

CamGrid and PBS

It's quite common for groups to have PBS facilities for running their tightly coupled MPI jobs, and since v6.8 Condor could talk directly to PBS. There are a number of caveats, the main one being that the PBS queue has to be on the same machine as the Condor schedd submitting the job to it. However, Condor also provides the Condor-C mechanism of delegated job submission, so the road is open for a schedd on machine A to request a schedd on machine B to submit a job to the PBS queue on B. Note that in what follows I describe how to submit jobs from a Condor queue to a dedicated PBS cluster. The processors running under PBS are not directly part of a Condor pool. If you're interested in having nodes belonging simultaneously to a Condor pool and a PBS cluster then you may consider the scavenging model described here.

MPI without PBS

Actually, Condor has the parallel universe for running parallel jobs directly. This can be seen as a direct competitor to PBS, and we discuss how one may configure and use it within CamGrid here.

PBS without MPI

We'll start by looking at getting PBS to work without MPI, i.e. just simple "fork" type jobs like one would run on an LCG farm. First we'll need to get Condor configured on the PBS head node. This needs to have the right security settings set up between the the two schedds, and a simple insecure method for testing would set:

SEC_DEFAULT_NEGOTIATION             = OPTIONAL
SEC_DEFAULT_AUTHENTICATION          = OPTIONAL
SEC_DEFAULT_NEGOTIATION_METHODS     = CLAIMTOBE
SEC_DEFAULT_AUTHENTICATION_METHODS  = CLAIMTOBE

Corresponding (and also insecure) settings on the delegating (i.e. client) node would be:

SEC_DEFAULT_AUTHENTICATION          = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS  = CLAIMTOBE

We'll also need to configure $CONDOR_HOME/lib/glite/etc/batch_gahp.config to point at the PBS installation. Now suppose I want to submit an executable called "Test.sh" to a PBS headnode called iguana.my.domain, which is in a Condor pool managed by the machine donkey.my.domain (or at least that's where that pool's Collector resides). Then the Condor job file that I submit will look like:

universe = grid
executable = Test.sh
output = myoutput
error = myerror
log = mylog

grid_resource = condor iguana.my.domain donkey.my.domain

+remote_universe = grid
+remote_grid_resource = pbs

queue

 

We launch this with the usual condor_submit directive.

PBS with MPI

We now want to run a similar job to the above but this time I'm submitting an MPI-enabled executable. The problem is that the Condor->PBS interface does not naturally allow for this, unlike the Globus or NorduGrid interfaces, by which I mean that there is no RSL option to specify PBS-specific options (e.g. number of processors, etc.). To circumvent this limitation I have to wrap the actual command I want to run in a wrapper script and pass that to the appropriate MPI execution command, so mpirun for MPICH, etc.. In this example I will run the parallel executable "cpi", which is actually built as part of the MPICH suite in the < MPICH install dir >/examples directory. I'm going to send the job to iguana.my.domain, which runs a Condor schedd and is in the pool managed by donkey.my.domain. So iguana also runs the PBS queue, whereas donkey is oblivious of all things PBS. First the submit script:

universe = grid
executable = wrapper
transfer_input_files = cpi
WhenToTransferOutput = ON_EXIT
output = myoutput
error = myerror
log = mylog

grid_resource = condor iguana.my.domain donkey.my.domain

+remote_universe = grid
+remote_grid_resource = pbs
+remote_requirements = True
+remote_ShouldTransferFiles = "YES"
+remote_WhenToTransferOutput = "ON_EXIT" 

queue

Now for the wrapper script we pass to be run on the PBS cluster. I'm going to ask for two processors, so:

#!/bin/sh
chmod +x cpi
/path/to/mpich/installation/bin/mpirun -np 2 cpi

Job completion times

The time for job completion can actually be quite slow, even if the PBS job itself executes quickly. This is mainly due to Condor's polling mechanism for determining the progress of each step, which by default is set to five minutes. If this is too slow for you then look to modify the value of the setting CONDOR_JOB_POLL_INTERVAL.