Notes for new users
New users to CamGrid should consider following these steps:
- Read the User's chapter from the HTCondor manual. This will tell you how to submit and manage your jobs within HTCondor. Also check out these simple submit file examples.
- The Computing Service runs introductory courses for using HTCondor and CamGrid. These are a great way for you to gain hands-on experience, and details of upcoming ones can be found here.
- Get yourself added to the ucam-camgrid-users mailing list (go here). This list carries important information about service outages and new tools/tutorials. It only has a bandwidth of 2-3 messages a month and new users are expected to join.
- CamGrid's made up of Linux machines running various distributions. This means that jobs compiled under one Linux distribution may end up running under a different one, which can result in certain library incompatibilities. For this reason it may be preferable to submit statically linked applications if dynamically linked ones encounter problems.
- If your jobs are going to run for a long time (many days), and so run the risk of losing all their accumulated work in the case of a disruption, then you may want run in the standard universe. This can make your life easier, since jobs that get kicked off execute nodes, either due to the HTCondor policy those nodes run or (more likely) due to network glitches between departments, will periodically checkpoint (save state), and continue on another machine from where they left off. However, your code must be compiled (not scripted or interpreted) and must be able to link with HTCondor's relevant libraries. You can check here to see whether your application is compatible with the standard universe. Generally, I recommend that CamGrid users stick with the Vanilla Universe.
- Since CamGrid does not have a single filesystem that spans its entirety, then when using the Vanilla Universe make sure that you request Condor to use its own file transfer mechanism by having in your submit file:
transfer_input_files = < comma separated list of your input files > should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT
- If you run in the vanilla universe then try to mitigate any potential problems from jobs being restarted. For example, there's really not much point in having flocked jobs that run for more than several hours (no more than 24) since it is quite possible for them to lose network connectivity and restart from scratch, so a job that should take a few hours can actually take days or even weeks to complete (if at all), constituting a waste of CamGrid's resources. Try to ensure that your application can save its state and restart where it left off. You can then get your jobs to perform application level checkpointing by wrapping it in a recursive shell script, or maybe use Parrot to write output directly back to the submit host in order to re-use for future runs. If your local CamGrid sysadmin is willing to get involved, then you can even try checkpointing your vanilla jobs directly.
- By default execute hosts will run a job to completion. However, where this is not the case, e.g. a machine may be prepared to preempt a running job for one with higher priority, then that machine will advertise the CamGrid-specific classad value VANILLA_DANGER as TRUE. If you don't want your vanilla universe jobs to be preempted (this is not an issue for checkpointable standard universe jobs) then check for it in your submit file's "Requirements" line by adding the requirement "VANILLA_DANGER =!= TRUE".
- In a non-Standard universe, Condor will automatically bring back any output files created in the scratch directory since the job started, but not sub-directories. Hence, if your application produces sub-directories make sure that you turn them into files (e.g. by tar-ing and gzip-ing them) before the job exits. HTCondor will then return this compressed file back to the submit host.
- Very short jobs (those lasting a few minutes or less) have their own problem. HTCondor can take a few minutes itself (especially if flocking) to match a resource with a job, so it's quite wasteful for a job to only last a relatively short duration executing once it has started. If you do have applications with this behaviour then try to bundle a number of them into a single HTCondor job so that the latter takes about an hour to run.
- Think about the amount of output that your application produces. When those jobs finish then the remote execute hosts will all try to write their output data back to the submit host: are you sure you have the space available? Also, don't abuse the execute nodes, and avoid producing more than 10GB of scratch data per job.
- Attempting to launch too many jobs to run simultaneously on the grid can stress your submit host and cause none to run. This is usually related to the submit host attempting to transfer all the required input files and being timed out (especially true if your input files are being served off a non-local disk, e.g. an NFS mount), causing the jobs to be rescheduled in a vicious cycle. If you think that this is happening to you, then launch your jobs in batches or let HTCondor's workflow manager, DAGMan, throttle the number of running jobs. To perform the latter, put all jobs into a .dag file (call it "everything.dag"), without any parent relationships, and then submit (max 25 concurrent jobs allowed in this example):
condor_submit_dag -maxjobs 25 everything.dag
- By default, HTCondor is set up to run serial (i.e. single threaded) jobs. Hence, it will allocate one core per job. If your task is multi-threaded, or multi-process (e.g.uses MPI), then Condor can support this but it will need special handling. If your application will use multiple processes/threads on a single host, then request the required number of cores by adding to your submit file:
request_cpus = <number of required cores>If your application needs to span multiple hosts using MPI then in the first instance contact your local sysadmin for advice.
- It is your responsibility to be aware of your job's characteristics, e.g.its memory requirements. By default HTCondor will allocate an amount of memory to your job (decided by the execute host's properties and maybe even configuration), but if your job then grows beyond what's been allocated then there's a chance of it being killed, either by HTCondor's configuration or even by the operating system if the machine starts to run out of memory. Hence, determine how much memory your job is most likely to need and add to your submit file:
request_memory = <memory required, in MB>
- HTCondor provides its own scratch space to work in. However, some applications require one with a well known path, e.g. /tmp. If your application is one of these, then please check the CamGrid classads TMPDIR and CANNOT_WRITE_TO_TMPDIR. The former is a string containing the path to the scratch space available on that execute node (defaults to "/tmp"), and the latter is a boolean telling you whether you're allowed to make use of TMPDIR (defaults to FALSE, so by default you're allowed).
- Why not take a look at other user guides written for CamGrid? Here's one from Biological Sciences and another from Astrophysics.
- CamGrid's free at point of use. All we ask is that you acknowledge its use in any publication arising from work carried out on it, and then notify me of the article details at the email below.
- If you run into difficulties then your local sysadmin should be your first point of call. See the list of "Contacts" if you're unsure who your sysadmin is. You should also consider using the ucam-camgrid-users mailing list mentioned above and the more general htcondor-users mailing list, for which you can subscribe here (note that the latter produces about 5-10 emails a day).