Last updated 21/10/2010
If you've ever run a computational job of any serious duration you'll appreciate the value of being able to periodically checkpoint your application, i.e. save its state, so that in the eventuality of a mishap within the computing environment (e.g. execute-node crash, or network outage in a distributed system, etc.) your job will be able to restart from the last state it saved, rather than beginning from scratch all over again. Condor currently addresses this issue with the standard universe, which requires the user to link his object code with Condor's libraries. Not all applications can be so linked, and a list of limitations can be found here. Furthermore, in a linux environment the standard universe only works with a limited number of compilers, which is a problem if your favourite compiler is not supported. On top of this, conversations with some of the Condor developers suggests that the standard universe may have a limited future, especially for linux, due to certain developments within the linux kernel which are making user-space checkpointing increasingly difficult. Ideally we'd like to implement a scheme that checkpoints vanilla universe jobs without the need for linking to Condor's libraries, and to achieve that we need two main ingredients:
- An API for checkpointing a running job.
- A user-space file system that will allow me to save a checkpointed image off an execute node.
We'll address these issues below and give an implementation that performs the required actions. As an aside, note that although this page deals exclusively with Linux, BLCR has been reported working with Cygwin and coLinux, so there's hope for those Windows boxes yet. Indeed, the University of Reading employ the method described here under coLinux.
The BLCR kernel modules are modules that one builds for the execute hosts and then loads into the kernel in order to provide checkpointing functionality (note that root privilege is required to perform such an action). It's probably best if I continue by quoting from the BLCR website:
"BLCR (Berkeley Lab Checkpoint/Restart) allows programs running on Linux to be "checkpointed" (written entirely to a file), and then later "restarted"".
"BLCR performs checkpointing and restarting inside the linux kernel. While this makes it less portable than solutions that use user-level libraries, it also means that it has full access to all kernel resources, and can thus restore resources (like process IDs) that user-level libraries cannot. This also allows BLCR to checkpoint/restart groups of processes (such as shell scripts and their subprocesses), together with the pipes that connect them."
There are limitations to the type of application that one can checkpoint using these modules. One should also note that BLCR will work best in a homogeneous environment, where the execute hosts have the same kernel versions. Where this is not the case then it's best if I quote a BLCR developer who I queried on the subject:
"... a restart may fail in strange ways if the kernel was configured with a different feature set. ... However, I would say that among kernels with the same version number and architecture you have a high probability of success."
Keeping in mind these words of warning, we can proceed by downloading the BLCR source code from here and following the installation instructions given in the Administrator's Guide. Alternatively, check to see whether your Linux distribution provides a BLCR package for you, e.g. Ubuntu 9.10 and Debian 6 onwards certainly do. Once installed, we can issue commands via BLCR's API to checkpoint applications executing on a machine. For our purposes these details are not important as they'll be wrapped in the implementation provided below, except to note that the wrapper makes use of a call to BLCR's cr_run, which requires your application to be dynamically linked.
Parrot is a user-space (i.e. one that doesn't need superuser privileges to set-up and use) filesystem tool which came out of the Condor project. I've already described its use at some length elsewhere on the CamGrid webpages (see here), so I won't dwell on it. One can also find the relevant documentation at the project webpage. The point is that we'll use Parrot to provide us with a distributed filesystem for saving the checkpointed images produced by the BLCR modules off the execute host and onto a checkpoint server of our choosing. If your machines all share a shared filesytem like NFS then that's just dandy, but for the general case where that's not available, e.g. as in CamGrid, we'll need to provide our own mechanism and that's the case addressed here. You'll need to download and build Parrot and some other binaries (e.g. chirp_server) from the ccTools download page.
Parrot provides a number of protocols for file transfer but the one we'll use is Parrot's native protocol, chirp. Our wrapper will depend on this. You'll also need to decide on an authentication scheme, and in our example I'll use hostname/address, though you're free to choose; just read the Parrot documentation. So, a quick example of starting a chirp server on my chosen checkpoint server. This can be set up by a user for their own personal use or maybe by a sysadmin for general use. Either way, no superuser privilege is needed to perform this action.
Suppose I want to export the directory /home/mcal00/data on woolly--escience.grid.private.cam.ac.uk (172.24.116.7) and want to give read and write access to all machines in the domain grid.private.cam.ac.uk. I start by creating the file .__acl in that directory and adding the contents:
You are urged to read the user manual for details of the syntax. Next I'll start a chirp server that exports this directory:
woolly% chirp_server -u - -r /home/mcal00/data -I 172.24.116.7 -p 9096 &
Note that this is just an example. Don't assume that there really is a chirp server running on woolly--escience that you can attach to.
So by now we've got the BLCR modules built and installed on our execute nodes and have a chirp_server running on a machine waiting to receive our checkpointed images. In case not all of the execute machines in our pool(s) have the BLCR modules installed, we can add a classad to machines with BLCR by adding the following to their condor_config.local files:
HAS_BLCR = TRUE STARTD_ATTRS = $(STARTD_ATTRS), HAS_BLCR
I shall assume that this is the case in what follows. One can use this mechanism to further fine grain resource discovery in a heterogenous pool by advertising specific kernel builds, hence giving migrated checkpointed images a better chance of working, or use a more elegant mechanism involving selective requirements for rescheduled jobs, but we won't bother in this example.
We'll next need to wrap the application we want to submit in a wrapper that will perform the checkpointing for us. I've provided an implementation ; save the contents of that link as blcr_wrapper.bash and set its execute bit with chmod +x. The submit script for the job now takes the form (note the similarities with Parallel Universe submit scripts):
################################## Universe = vanilla Executable = blcr_wrapper.bash arguments = < FQDN of chirp-server > < chirp-server port > < checkpoint period > \ $$([GlobalJobId]) < application > < args >
# NOTE: parrot_run used to be called parrot in previous CCTOOLS releases. transfer_input_files = parrot_run, < application >, < input files > transfer_files = ON_EXIT Requirements = HAS_BLCR == TRUE < && other requirements > Queue ##################################
Some notes on what the variable quantities should be:
- FQDN of chirp-server: The fully qualified domainname of the machine running the chirp-server.
- chirp-server port: The port the chirp-server's listening on.
- checkpoint period: How often we want checkpoints to be taken (in minutes). Be sensible: at least an hour?
- application: The name of the application you want to run.
- args: The command line arguments for your application.
- input files: The input files used by your application (comma separated).
Carrying on from the Parrot example above, I want to submit an application called my_application, which takes the arguments A and B, and also reads input from the files X and Y. Furthermore, I want to run on machines with a X86_64 architecture (only), checkpoint every hour and save stdout, stderr and Condor's log information. The input file would be:
Universe = vanilla Executable = blcr_wrapper.bash arguments = woolly--escience.grid.private.cam.ac.uk 9096 60 $$([GlobalJobId]) my_application A B transfer_input_files = parrot_run, my_application, X, Y transfer_files = ALWAYS Requirements = OpSys == "LINUX" && Arch == "X86_64" && HAS_BLCR == TRUE Output = test.out Log = test.log Error = test.error Queue
For the record, blcr_wrapper.bash sets off three processes in parallel which have the following roles:
- parrot process: responsible for performing all I/O with the chirp_server.
- job process: constitutes the actual application being run.
- checkpoint process: the parent process and in charge of periodically checkpointing the application.
Interprocess communication is carried out using named pipes, and before you worry unduly about the number of processes note that all apart from the job thread spend most of their time sleeping or read blocking.
This approach is not without its limitations:
- As noted above, blcr_wrapper.bashcurrently expects your application to be dynamically linked. If you need to run statically linked applications then you could try linking with:
- gcc -L
- -lcr -o a.out ./test.o
where I've assumed I've only got one object file, test.o, in this example.
- Since the BLCR modules require constant full file names/paths, we cannot achieve this using Condor's scratch directory naming system, so we operate in /tmp. This is usually OK, except that if a job dies unexpectedly then it may leave garbage in /tmp and the chirp_server that Condor can't clean up. Using condor_rm to kill your job can lead to this condition since this gives the job an untrapable SIGKILL. Try using condor_vacate_job first since this delivers a SIGTERM.
- On receiving SIGTERM, blcr_wrapper.bash will automatically checkpoint the application before exiting. Such a scenario could happen when a job is being preempted. This means that if you set kill_sig to a different signal in the submit script then no checkpointing prior to exit will be performed.
- The BLCR library libcr.so must be in your env. var. LD_LIBRARY_PATH on the execute machine. By default this library is placed in /usr/local/lib, but if this is not the case on your machines then edit the value accordingly in blcr_wrapper.bash.
- The parent process uses/receives SIGUSR1 and SIGUSR2 for internal communications.
Thanks to Doug Thain, Paul Harford, Dan Bradley and José Martin for helpful correspondence/discussions.
Please send any comments or feedback to the email below. If you find this technique useful then I'd be grateful if you'd also email me (mc321 at cam.ac.uk), just to let me know.