UM_Bristol

Documentation for clustersubmit

Introduction

clustersubmit is the main script that we use to submit UM4.5 (HadCM3, FAMOUS etc.) jobs to our HPC machines. It has evolved from a script called umsubmit which is distributed with the original code from the Met Office. It has been modified and extended frequently and is now a complex code. It has occasionally had major clean-ups, but further additions end up making it a rather untidy and only semi-structured script.

The umui database stores all of the information about a simulation. When the process button is pressed, the umui creates a series of files in the $HOME/umui_jobs/expt_name folder which contains all of the information needed to run the model. These files need to be copied to the HPC. clustersubmit then takes these files, makes a unique copy in the umui_runs folder (changing some parameters), and submits the UM main script to run on the machine. clustersubmit has gradually become linked to the BRIDGE setup and is not especially transferrable to other machines and setups.

One of the key roles of clustersubmit is to allow the job in the umui to be independent of the target HPC machine. This is achieved by keeping to some common naming conventions within the umui. In the umui, folder and file locations are referred to using standard location names (such as /home/swsvalde or /home/username). These are then converted by clustersubmit to the appropriate local folder name on the relevant HPC machine

A second role of clustersubmit is to swap between compile jobs, new run jobs, and continuation runs without the need to make any changes in the umui itself. In general, it makes life a lot simpler and that the copy of the job within the umui is always the compile option.

A third role of clustersubmit is to change the run time and number of processors, depending on the machine and queue being used. Again, this avoids the need to change these within the umui itself.

clustersubmit has also evolved to do a few other things too.

Example usage

(a) Compile job:

clustersubmit -c y -s y -q cpu -r bc4 -P abcdef expt_name

(b) New simulation:

clustersubmit -c n -s y -a y -q veryshort -r bc4 -P abcdef -p 7x4 expt_name

clustersubmit -c y -s y -a y -q compute -r bp14 -P abcdef -p 6x4 expt_name

Configuration File

It is sometimes useful to set some standard defaults for the machine being used. This can be done by creating a file called: $HOME/.um/clustersubmit.conf.

Typical options are shown for a typical bluepebble configuration:

cores_ns=”6”
cores_ew=”4”
nomail=”y”
notransfer=”y”
rhost=”bp14”
queue=”compute”
account=”GEOG015942”

Typical options are shown for a typical bluecrystal configuration:

cores_ns=”7”
cores_ew=”4”
nomail=”y”
notransfer=”y”
rhost=”bc4”
queue=”cpu”
account=”GEOG015942”

Argument List

Name	Description	Valid Options	Default
-expt	Experiment name	Any	None. Compulsory
-r	The remote machine that will run the simulation	A number of short names for clusters. See note 1 for full list.	None Compulsory
-q	The queue to submit the job to. The name will depend on the machine. The queue name will also often impact on the default run time.	See note 2 for a list of available queues	None. Compulsory
-P	Account name	Adds user account for job	Currently needed for bluecrystal and bluepebble. Can be set as a default using the file $HOME/.um/clustersubmit.conf
-p	Number of cores, in the format NSxEW	Any but total core number must equal size of node (currently 28 cores on bluecrystalp4, 24 cores on bluepebble)	Optional. If not specified, will take value from umui job. Defaults can be set in file $HOME/.um/clustersubmit.conf file
-g	As -p but without any check on total core number.
-w	Wall time	Format is: 24:0:0 (ie. 24 hrs, 0 min, 0 secs) or 14-0:0:0 (i.e. 14 days)	Optional. If not specified, then maximum for the queue specified.
-u	Username	Any	Optional: default is the same user as submitter
-F	Force submission	Clustersubmit will check to see if there is already a job running with same name	Optional: Default = n, Option=y
-n	Email output	Umui jobs can email outputs to user. This can override options	Optional: Default: same as in umui, option=y
-e	Create ensemble	It is possible to merge a number of umui jobs into a single multnode job. This can sometimes be more	Optional: Default=n, option=y
-S	Short hand for the best continuation run	Should be used for a continuation run, after compilation	Optional: Default=n, option=y If y, then the same as: -c y -s y -a y
-C	Short hand for the best settings for a compile job	Should be used for compile job only	Optional: Default=n, option=y If y, then the same as: -c y -s y
-A	Short hand for new run, after compilation	Should be used for a new run, after compilation	Optional: Default=n, option=y If y, then the same as: -c n -s y -a y
-a	Change compilation	This changes the job from a compile job to non-compile or vice versa.	Optional: Default=n (no change) option=y (change setting)
-c	New or continuation run	Changes settings to make it a new run, or continuation run	Optional: Default=n (new run) option=y (continuation run)
-l	Write data to local disk	On some machines, the nodes had local disks which are a lot faster than networked disks.	Optional: Default=n The option=y is currently not used because none of the machines have local disk.
-m	Rerun the reconfiguration	This is an unusual option. See Note 3 for full explanation	Optional: Default=n, option=y
-s	Submit job to queue	This is the normal option but could just create the umui_runs folder	Optional: Default=y, option=n
-f	Copy umui_job files to HPC machine	We used to submit jobs from puma and it was useful to copy the umui_job.	NO longer used. We can no longer submit jobs from puma due to security issues.
-d	Debug	This print additional information to help debugging	Optional: Default=n, option=y
-v	Verbose	More output	Optional: Default=n, option=y
-V or -h -or -?	Version	Prints some information about the script	Optional: No inputs

Note 1: HPC Machine Names

Clustersubmit name	Full Machine Name	Description
CURRENT MACHINES
bc4	bc4login.acrc.bris.ac.uk	Latest, fourth bluecrystal machine
bc4login1 bc4login2 bc4login3 bc4login4	bc4login1.acrc.bris.ac.uk bc4login2.acrc.bris.ac.uk bc4login3.acrc.bris.ac.uk bc4login4.acrc.bris.ac.uk	Submitting to bc4 would submit to a general pool of 4 machines. Occasionally, there are problems with this system so it is useful to be able to submit to specific machine.
bp1	bp1-login01.acrc.bris.ac.uk	Bluepebble HPC machine, using the generic head node
bp14	bp1-login04.acrc.bris.ac.uk	Bluepebble HPC machine, using the BRIDGE head node

LEGACY MACHINES
quest-hpc quest	quest-hpc.bris.ac.uk	Old quest machine which was a small clusters machine
Ormen	ormen.ggy.bris.ac.uk	Small predecessor of quest machine
babyblue bc bluecrystal bluecrystalp1	bluecrystalp1.bris.ac.uk	The first bluecrystal machine
bigblue bigbluecrystal bluecrystalp2	bluecrystalp2.bris.ac.uk	The second bluecrystal machine
newblue bluecrystalp3	bluecrystalp2.acrc.bris.ac.uk	The third bluecrsytal machine
newblue1 newblue2 newblue3 newblue4	newblue1.acrc.bris.ac.uk newblue2.acrc.bris.ac.uk newblue3.acrc.bris.ac.uk newblue4.acrc.bris.ac.uk	Submitting to newblue would submit to a general pool of 4 machines. Occasionall, there were problems with this system so it was useful to be able to submit to specific machine.

Note 2: Queue Names

On bluecrystalp4:

Queue Name	Characteristics
cpu	Maximum 14 days run time
test	Maximum 1 hour run time
veryshort	Maximum 6 hour run time
bridge	Maximum 14 days run time. Limited number of nodes available. Available to selected users only.
paleo	Maximum 14 days run time. Limited number of nodes available. Available to selected users only.

On bluepebble:

Queue Name	Characteristics
compute	Maximum 14 days run time
test	Maximum 1 hour run time
short	Maximum 3 days run time
djl	Maximum 14 days run time. Limited number of nodes available. Available to selected users only.
dmm	Maximum 14 days run time. Limited number of nodes available. Available to selected users only.

Note 3: The -m option

When starting a new model simulation, before the model starts we need to reconfigure the input dump files. This step takes the input dump files, and writes out new dump files typically called $DATAW/$RUNID.astart and $DATAW/$RUNID.ostart. On some occasions, the atmosphere dump reconfiguration crashes. This mostly happens when changing land sea mask but will happen on other occasions too. The output dump file ($RUNDIR.astart) is created but not complete. Sometimes, reconfiguration will work if you re-run the reconfiguration, but starting from the incomplete output dump file ($RUNDIR.astart) and writing out a new dump file (e.g. $RUNDIR.astart1). If you select the (-m y) option, this will all be done automatically.

Note 4: Path translation

Clustersubmit will take folder names from the umui job, and translate a few standard folder names to the specific locations on the relevant machines. This allows you to use universal folder names in the umui and these do not have to be changed depending on the target machine. Currently the folders which are automatically translated are:

Folder in umui	Folder on bluecrystal	Folder on bluepebble
/home/swsvalde	/mnt/storage/private/bridge/swsvalde	/bp1/geog-tropical/users/swsvalde
/home/username $^*$	/user/name/username	/user/home/username
~username1 $^*$	/user/name/username1	/user/name/username1

$^*$ Where username is the name of the user running the simulation and username1 is the name of the owner of the folder