skip to primary navigationskip to content
 

Technical Details

Introduction

CamGrid is made up of a number of HTCondor pools belonging to like-minded departments and groups (see the list here for participating groups) that allow their resources to be shared by using HTCondor's flocking mechanism. Each group decides what level of access it provides to its pool(s) and how to configure its machines. Such a federated approach means that there is no single point of failure in this environment, and the grid does not depend on any individual pool to continue working. Currently most machines are modern dedicated servers running in 64 bit mode, but there are some desktops of various vintage as well. CamGrid is practically all Linux (various distributions), with hundreds of cores/processors on the grid. This isn't by design, but just what participating groups have chosen to put on there. New groups are always welcome to join, initially contacting Mark Calleja.

New to CamGrid?

Find introductory notes behind these links for new users and sysadmins to CamGrid.

What's expected of participating pools?

Individual groups and departments are asked to initially set up their own, isolated, HTCondor pool which they can use to become conversant with the technology. Only after local users and administrators have got their pool to a robust state should they ask to join CamGrid, as this will greatly simplify any residual trouble-shooting. A full list of requirements is given here. The latest release of HTCondor's 8.2 stable branch should currently be used, and upgrade notes for sysadmins are given here.

The Architecture

A simple graphical representation of CamGrid's flocked architecture is given here.

Implementation and configuration

Implementation details and an example of how to modify the HTCondor configuration for a machine on CamGrid can be found here. Note that some of the configurational changes mentioned in this section are necessary for machines on CamGrid, so please read carefully (this page is of particular relevance to sysadmins).

CamGrid Monitoring Tools

A suite of tools has been provided for monitoring the status of the flocked pools in CamGrid and the configuration values of individual nodes. One can also find monthly CPU usage figures as well as details of individual pools. In addition, a web interface allows users to monitor all files being produced by their vanilla and parallel/MPI jobs on the fly. Find these resources here.

CamGrid and MPI

It is possible to run MPI jobs under HTCondor, though it requires extra configuration by sysadmins. Notes on how to configure, and submit jobs to, MPI facilities within CamGrid are described here.

Accessing data across CamGrid using Parrot

Data handling in HTCondor's vanilla universe can be cumbersome, especially if one would like to export a directory structure to an execute node not using a shared file system. Similarly, output files produced in the vanilla universe stay on the execute node until the job completes, which can be inconvenient if we require visibility to these files while the job is in progress. One solution to both of these problems is to use Parrot, which provides a user-space (i.e. unprivileged) approach. Details for this method, including an example, are given here.

Checkpointing Vanilla universe jobs using BLCR kernel modules

It is not always possible to use HTCondor's Standard universe to checkpoint applications, plus that universe's long term future is in doubt. However, it is possible to checkpoint Vanilla universe jobs transparently by using the BLCR kernel modules and Parrot. Details of this approach can be found here.

Application level checkpointing

HTCondor's standard universe is great for performing automatic checkpoints of running jobs, but many applications cannot run in this universe due to library linking issues. In this case, one can either try the BLCR approach mentioned above, or an application level recursive approach.

Vanilla/Parallel universe file viewer

HTCondor's vanilla universe is the simplest one to use, but comes at the cost of not being able to see files being generated on an execute machine during run time. Indeed, a similar problem occurs with parallel universe (MPI) jobs. To circumvent this problem I have developed a web-based utility which allows users to see their files anywhere on CamGrid. Details of the mechanism are given here.

PyDAG: A graphical DAGMan builder

HTCondor's DAGMan workflow tool is a useful and powerful utility, but generating its myriad scripts can be tedious for all but the simplest workflows. PyDAG is a simple GUI I've written to help circumvent the problem.

CamGrid and PBS

Some notes on how PBS resources with or without MPI functionality can be integrated and used within CamGrid are described here.

External Access

CamGrid is meant for members of the University of Cambridge. However, since university members may wish to allow external collaboraters use of this grid, e.g. via a Globus interface, then the following procedure has been agreed upon by current stakeholders to facilitate such activity: the administrator of the pool from which these users are to submit jobs will notify all other pool administrators of the usernames and submit machines involved. This notification can be achieved by emailing the ucam-camgrid-admins mailing list, and the relevant level of filtering may then be applied by the other pools.