skip to primary navigationskip to content
 

A Vanilla/Parallel universe file viewer

Condor's vanilla universe provides the simplest mechanism for job submission since it does not require the application to be linked against Condor's libraries, like in the standard universe, but this simplicity comes at a price. Apart from being unable to checkpoint these jobs, one also has no simple way to monitor the output that's being generated by these jobs on the fly unless a shared file system is used. However, shared file systems are not always feasible, e.g. as in CamGrid's multi- administrative domain environment. This can be a real shortcoming for long-running jobs, since the application may be wasting its time (e.g., a non-converging optimisation) without the user being aware. Similar problems arise with parallel universe (usually MPI) jobs where a shared file system is not being used.

A web based file viewer

In order to circumvent the issues mentioned above, I have produced a web-based file viewer for CamGrid users. This can be accessed here, and a password for each user can be obtained from me (Mark Calleja, email: mc321 at cam.ac.uk). It works by having an agent running on each execute host that listens for incoming requests from the UCS' dedicated webserver, and requests are proxied via this server. On entering their username and password, a session is started for the users and a cookie is loaded into their browser. This initial transaction takes place via HTTPS, and the agent will only answer requests eminating from the UCS webserver.

The architecture

The basic architecture is embodied in the figure below. Each execute node in the grid runs an agent (a Perl daemon in fact), called a slave listener, that listens for requests from a central webserver running a number of CGI scipts. The interaction starts by the user logging onto the webserver (using HTTPS and cookies for session information), which in turn interrogates a single master listener about the user's jobs in the grid. These can span a number of different flocked pools, so the master listener effectively wraps the necessary condor commands to do the discovery. This information is passed back to the user via the webserver (hence all traffic is proxied via the webserver), which is presented with a listing of machines that his jobs are running on, with each instance being given as a link. Clicking on a link activates another CGI script, which this time interrogates the relevant agent on that execute node, which sends back a directory listing (together with file sizes and mtimes) in the scratch space for that job, which is again presented as a list of links. Clicking on any link (or right-clicking and saving) presents the user with that file's contents.

File viewer

Installing the execute host agent

The agent is written in perl and can be downloaded from here. Rename the file to slave_listener.pl, make sure that you have set the environment variable CONDOR_HOME to point at the Condor distribution on that execute host, and after satisfying yourself that the script doesn't do anything malicious you can start it as an unprivileged user (user "nobody" in this example) in daemon mode with:

 su -c "/path/to/slave_listener.pl 6061" -s /bin/bash - nobody < /dev/null >/dev/null 2>&1

The agent needs to listen for TCP requests on port 6061 from the CamGrid address of UCS' webserver only, which is at 172.24.189.69. Hence, you can tighten any firewall/IP-table rules as required. The agent daemon will actually reject any requests it doesn't think originate from the UCS webserver. Also, the user you run it as must have read and execute permissions in the execute directory that Condor uses. Note that you may need to install extra Perl modules to provide the required functionality, e.g., the URI.pm module.

Possible error messages

You may get the following error message when attempting to connect to a machine running a job of yours:

Software error:
Couldn't connect to < hostname >:6061 : IO::Socket::INET: connect: Connection refused

This is probably because the Computer Officer in charge of that machine has not bothered to activate the listening daemon on it, in which case please contact him/her directly to ask why.

Limitations

This agent implementation currently only works under *nix. I did write a version for Windows some time back, but since there doesn't seem to be much demand for such resources on CamGrid, I've tended to neglect its development over recent times. If anyone feels a burning need for such a facility then I may be persuaded to resurrect it.