Batch working

Condor 8.6.11 is installed on our Linux systems to allow batch work. If you are running a task requiring more than a few minutes of CPU, you should use the batch system, as this attempts to ensure that such tasks run at an appropriate priority so as not to adversely affect the main user of the system.

We have a single Condor batch system, with a mix of desktops and dedicated CPU servers. The POOL variable (see below for more details) is intended to allow users to choose which class of machine their job is submitted to. Users are encouraged to use the dedicated servers in preference when at all possible. They don't have direct access to locations outside Cambridge, which might cause difficulties with certain types of jobs (for example AFS access is imperfect). If this causes difficulties for you then please talk to the management team about your needs

To use Condor, you need to create a job containing instructions to the batch system. A trivial example follows:

Executable = a.job
Universe = vanilla
Requirements = (POOL == "GENERAL") && (OSTYPE == "SLC6") && ((Arch == "X86_64") || (Arch == "INTEL"))
Rank = Memory
require_memory = 500 MB
output = a.output
error = a.error
Log = condor.log
copy_to_spool           = true
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT_OR_EVICT

The "Requirements" field allows you to control the destination of the job. POOL and OSTYPE are private parameters added by us to help with job management. POOL controls whether the job is forced to run on desktops ("GENERAL"), dedicated CPU servers ("GEN_FARM") or left for Condor to decide if unset. OSTYPE allows you to choose the SLC6 or CC7 systems if you need a specific operating system version. Arch specifies the architecture  on which the job can run, and is needed if the job is to run on an architecture different from that of the submit host ("INTEL" indicates 32-bit and "X86_64" 64-bit operating systems - though note that all systems are now 64-bit. This example is retained merely to highlight the syntax of the "Requirements" field).

The "Rank" field allows you to encourage the job towards certain machines (though it won't guarantee a specific machine - use the "Requirements" field if you want to run on a specific machine). The example will tend to use machines with more memory - you can use "Mips" or "kflops" to get machines with faster CPUs.

If you know that your jobs will run for several days and you are content for them to be suspended to allow shorter jobs to run, then you can add the option

+IsSuspendableJob = True

to the Condor job description file. This option is only available on the dedicated CPU servers, where extra job slots have been created to implement this option. Note that the total number of jobs which can run at the same time has not increased.

One problem that quite frequently occurs is that your job goes into the hold state with the status 'Exec format error'. Usually this means that the script you are trying to execute does not have the first line as:
(or similar). You must tell Condor which environment your script is running under.

There are more details on the submit script in the relevant section of the  manual.
Some useful commands are:

The Condor manual is available in HTML or PDF form. Be warned - it is over 1000 pages long and contains a vast amount of detail which the average user does not need to know.

You can view the current status of the Condor system. Statistical information is also collected.

Steve Wotton
and John Hill Last update 8 June 2018.