Center for Piezoelectrics by Design web page

Using the SLURM batch environment

The Opteron and Itanium clusters uses the Simple Linux Utility for Resource Management (SLURM) to control all jobs that run on the compute nodes. There is a maximum wallclock run time is 5 days. All jobs are scheduled using Maui and its fairshare algorithm (which attempts to give all users an equal share of the nodes over time). By default, users are given one core/processor per node, altough, an entire node (all four cores) at a time can be requested via the batch script.

Useful SLURM commands:

Useful SLURM commands and their PBS equivalent
command description PBS equivalent
sbatch submit a batch script qsub
srun run a command via SLURM interactively qsub -I
squeue list the jobs in the queue qstat
scontrol modify your job in some way qalter
scancel kill your queued or running job qdel

Running jobs on the Opteron cluster

There are currently two distinct ways to run jobs on the cluster, "locally" or "via NFS". Running locally means that all I/O is performed on a filespace local to the allocated node(s). In this mode, the actual executable need not be local (can be executed from a home directory via NFS). Running via NFS is only allowed if your job specifically requires a common filespace to work correctly, and does not do excessive amounts of I/O. Serial jobs should ALWAYS be run locally.

Example 1: running a serial job locally:

#!/bin/csh
#SBATCH -J TESTJOB
#SBATCH -n 1
#SBATCH -t 30
#SBATCH --mail-type=END

prepdir
cd $JOBDIR

./a.out > OUTFILE

Here is a summary of what this script does:

  1. Select a "csh" environment NOTE: it is best to use the same type of shell that you use on the front end
  2. -J chooses the job name
  3. -n# chooses the number of cores ("processors")
  4. -t 30 max wallclock time in minutes
  5. --mail-type=END email me when my job is finished
  6. prepdir this is a script that prepares a job directory on the node and transfers files there (see below)
  7. cd $JOBDIR cd into the prepared job directory
  8. ./a.out > OUTFILE run the job

 

PUTFILES and EXCFILES

Since there will always be a set of files to send to and receive from the local run directory, this process has been automated using the two files called PUTFILES and EXCFILES. PUTFILES lists the files that are to be sent to the node(s) before the job starts. EXCFILES lists the files that should NOT be returned after the job completes. The names in the files are treated as regular expression patterns, e.g.:
directory contents pattern in PUT/EXCFILES will match
a.out, INPUT, file1,file2,file3, INFILE, testdata,data,dataset infile,testdata2 file file1,file2,file3,infile
IN INPUT,INFILE
file1$ file1
file$ infile
^file file1,file2,file3
data testdata.data.dataset.testdata2
testdata testdata,testdata2
data$ testdata,data
^data data,dataset
^data$ data
2 file2,testdata2
. <all files>

You can always test the PUTFILES list by doing: ls |grep -f PUTFILES. This will show what will matched at the start of the run. A similar test can be done with EXCFILES by first using touch to create a list of output files and then using ls |grep -vf EXCFILES to show what will be returned.

Here are some important things to remember:

For the batch job above, we will have PUTFILES and EXCFILES contain a.out This will send over the program (a.out) but will only bring back the OUTFILE Since the job is named TESTJOB the files could also have been named: PUTFILES.TESTJOB and EXCFILES.TESTJOB

Now we submit the job. Users can submit job from their home directories or from the global scratch spaces, however, files are always returned to the global scratch space (see below). This run is submitted from the home directory.

<23~/> pwd
/home/ewalter/abrun
<24~/> sbatch run
sbatch: Submitted batch job 11856
<25~/> squeue -u ewalter
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  11856     batch  TESTJOB  ewalter   R       1:10      1 c1-1

 

Acessing the job directory while the job is running

The remote job directory is simply /lscr/<USER>/<JOBID>.run. To access it simply ssh to the node it is running on and cd to this directory.

During and after the run, there will be four files in generated to help you keep track of the job:

Filename Location When generated Description
slurm-<JOBID>.out submission directory when job starts This is the default name of the stdout and stderr for the job.
slurm-<JOBID>.info submission directory when job starts Lists the node(s) running the job, and other info.
slurm-<JOBID>.files submission & job directory when job ends Lists where the files will be put on the front end.
slurm-<JOBID>.hosts job directory when job ends List of nodes used during run

 

Once the job is finished, the run directory is sent back to one of two places, either at the top of level of your global scratch directory or in the original submission directory. To be clear, the job directory is returned to the submission directory unless the job was submitted from your home directory, in which case it is sent to the top level of the global scratch space (for now it defaults to /scr1).

submission path job return path
home directory /home/<USER>/* top of user scr1 directory
user's global scratch directory /home/<USER>/scr1/* or /home/<USER>/scr2/* returned to submission path

Since the job was submitted from my home directory, the job directory is returned to the top of my global scratch space in <JOBID>.run

~/scr1/11857.run>ls -l total 24 -rwxrwxr-x 1 ewalter ewalter 10573 Jan 31 14:25 a.out -rw-rw-r-- 1 ewalter ewalter 6 Jan 31 14:26 OUTFILE -rw-rw-r-- 1 ewalter ewalter 390 Jan 31 14:25 slurm-11857.files -rw-rw-r-- 1 ewalter ewalter 12 Jan 31 14:25 slurm-11857.hosts

 

Example 2: running a parallel job locally:

#!/bin/csh
#SBATCH -J parallel
#SBATCH -n 12
#SBATCH -t 30
#SBATCH --mail-type=END

prepdir
cd $JOBDIR

#If using openMPI (the default):
#mpirun -np 12 ./a.out > OUTFILE

#If using mvapich2:
srun -n 12 ./a.out > OUTFILE  

#If using mvapich1:
srun -n 12 --mpi=mvapich ./a.out > OUTFILE  

Besides the number of cores (now 12) , the only significant change compared to a serial run is how to start the executable. As shown on the software page, there are three separate Infiniband enabled mpi environments, openmpi, mvapich & mvapich2 ,. Users have the openmpi-1.2.5 module loaded by default. Openmpi uses mpirun to start job whereas both flavors of mvapich do not.

The master node is indicated by the SLURMD_NODENAME variable in the .info file. The progress of the job can be checked by logging into this node via ssh and looking in the /lscr/<USER>/<JOBID>.run directory

 

Example 3: running a parallel job in the global scratch space:

#!/bin/csh
#SBATCH -J parallel
#SBATCH -n 3
#SBATCH -t 30
#SBATCH --mail-type=END

mpirun -np 3 ./a.out > OUTFILE  

Simply by removing the prepdir command, SLURM knows to not bother with sending and retrieving files. Users MAY NOT run this way from their home directory, only from a directory in the global scratch space /home/<USER>/scr1/* or /home/<USER>/scr2/*. Also, users may not run a code that requires excessive I/O and/or significantly burdens the NFS network.



©2009 The College of William and Mary