It can be pretty frustrating for your job that's been running for the last few days to suddenly die, either because it has been evicted or maybe the execute host has gone belly up, and for the results of that run to go the way of the dodo. One way to ameliorate this problem in Condor is to use the standard universe, which periodically checkpoints an application by taking snap shots of the memory image of the running job. However, this requires us to be able to link our code with Condor's libraries, which in many cases is not possible either because we've only got an executable or because the language/compiler that the application uses are not compliant with the limited set that the standard universe supports. If it's simplicity we want then we can use HTCondor's facility for bringing back files when jobs are evicted or held, though that puts the onus on the user to modify the application to be able to use such a facility. Another possibility is to use the BLCR kernel modules to checkpoint vanilla jobs directly, but this only works with dynamically linked applications on linux execute hosts and requires a degree of sysadmin intervention to install the modules.
Fortunately, in many cases we may get around this by checkpointing at the application level by using a workflow with the following logic:
This isn't rocket science and people have been using this technique when dealing with batch queueing systems for years, but in a grid environment the implementation can be a bit tricky. First we need to decide what the real completion criterion for our job is, and how we can check for it. Then we submit short runs of our job, check to see at the end of each run whether we've met our completion criterion, and if not then to resubmit it after first making sure that we restart where the last step left off. This may sound a bit vague, so we'll demonstrate it with a concrete example from the field of molecular dynamics. The other thing to keep in mind is that although some of the ideas here may sound complicated, your local CamGrid computer officer may be happy to help you port your application to make use of this paradigm. Otherwise, contact me.
DL_POLY is an empirical potential molecular dynamics code that allows the user to specify the number of simulation time steps that a run should perform, as well as specifying the maximum length of wall time that a job should be allowed to run for. If a run reaches its maximum allowed run time before it has performed all the simulation time steps we've requested then the job will just be terminated, and its state up to that point will be saved in the output files REVIVE and REVCON. One can then start a new simulation to continue from where the last one left off by renaming REVIVE to REVOLD and REVCON to CONFIG. These are then used as input files by the new run, as long as we've also included the line with the single word "restart" in the file CONTROL. We could now choose to embody this workflow using Condor's own workflow manager, DAGMan, but for simplicity we'll roll our own using a shell script. The CONTROL input file is created such that the total number of simulation steps requested will be too long for the amount of run time that a job will be allowed. Hence, we'll adopt the strategy shown in the diagram above. We'll submit a job via the usual (but not directly, as will be explained shortly):
condor_submit <job file>
and wait for it to finish via the command
condor_wait <log file>
condor_wait scans the log file passed to it and blocks/waits until it detects that the job has finished. Once condor_wait returns we know that to be our cue to check the file OUTPUT for the line that says:
run terminated after XXXX steps
If the number XXXX is less than what we requested then we know that we need to move the output files to being input files, and resubmit the job. Hence, we need to run condor_submit followed by condor_wait for as many times as the individual jobs needed to complete the entire simulation. The natural way to express this is via a loop in a shell script, and we've embodied the necessary logic for this algorithm in the Bash script . This script takes one argument, namely the (which it will have to modify for subsequent steps after the first step has completed). You can run this example with the following command:
% recurse.sh job
Hence note that in this case we do not run the job by using condor_submit directly from the command line, but rather we let the script do it from within a loop as many times as it takes for the entire simulation to complete. Unlike condor_submit, recurse.sh does not return control to your prompt directly, but will instead run in the foreground. It will print a commentary as it does so, and will need to do 3 or 4 runs for the job to finally complete (whence you'll get your prompt back). You'll then find a lot of new files in your directory, including REVCON, STATIS and OUTPUT, which are usually the files you want.