The Opteron and Itanium clusters uses the Simple Linux Utility for Resource Management (SLURM) to control all jobs that run on the compute nodes. There is a maximum wallclock run time is 5 days. All jobs are scheduled using Maui and its fairshare algorithm (which attempts to give all users an equal share of the nodes over time). By default, users are given one core/processor per node, altough, an entire node (all four cores) at a time can be requested via the batch script.
| command | description | PBS equivalent |
|---|---|---|
| sbatch | submit a batch script | qsub |
| srun | run a command via SLURM interactively | qsub -I |
| squeue | list the jobs in the queue | qstat |
| scontrol | modify your job in some way | qalter |
| scancel | kill your queued or running job | qdel |
There are currently two distinct ways to run jobs on the cluster, "locally" or "via NFS". Running locally means that all I/O is performed on a filespace local to the allocated node(s). In this mode, the actual executable need not be local (can be executed from a home directory via NFS). Running via NFS is only allowed if your job specifically requires a common filespace to work correctly, and does not do excessive amounts of I/O. Serial jobs should ALWAYS be run locally.
#!/bin/csh #SBATCH -J TESTJOB #SBATCH -n 1 #SBATCH -t 30 #SBATCH --mail-type=END prepdir cd $JOBDIR ./a.out > OUTFILE |
Here is a summary of what this script does:
Since there will always be a set of files to send to and receive from the local run directory, this process has been automated using the two files called PUTFILES and EXCFILES. PUTFILES lists the files that are to be sent to the node(s) before the job starts. EXCFILES lists the files that should NOT be returned after the job completes. The names in the files are treated as regular expression patterns, e.g.:
| directory contents | pattern in PUT/EXCFILES | will match |
|---|---|---|
| a.out, INPUT, file1,file2,file3, INFILE, testdata,data,dataset infile,testdata2 | file | file1,file2,file3,infile |
| IN | INPUT,INFILE | |
| file1$ | file1 | |
| file$ | infile | |
| ^file | file1,file2,file3 | |
| data | testdata.data.dataset.testdata2 | |
| testdata | testdata,testdata2 | |
| data$ | testdata,data | |
| ^data | data,dataset | |
| ^data$ | data | |
| 2 | file2,testdata2 | |
| . | <all files> |
You can always test the PUTFILES list by doing: ls |grep -f PUTFILES. This will show what will matched at the start of the run. A similar test can be done with EXCFILES by first using touch to create a list of output files and then using ls |grep -vf EXCFILES to show what will be returned.
Here are some important things to remember:
For the batch job above, we will have PUTFILES and EXCFILES contain a.out This will send over the program (a.out) but will only bring back the OUTFILE Since the job is named TESTJOB the files could also have been named: PUTFILES.TESTJOB and EXCFILES.TESTJOB
Now we submit the job. Users can submit job from their home directories or from the global scratch spaces, however, files are always returned to the global scratch space (see below). This run is submitted from the home directory.
<23~/> pwd /home/ewalter/abrun <24~/> sbatch run sbatch: Submitted batch job 11856 <25~/> squeue -u ewalter JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 11856 batch TESTJOB ewalter R 1:10 1 c1-1 |
The remote job directory is simply /lscr/<USER>/<JOBID>.run. To access it simply ssh to the node it is running on and cd to this directory.
During and after the run, there will be four files in generated to help you keep track of the job:
| Filename | Location | When generated | Description |
|---|---|---|---|
| slurm-<JOBID>.out | submission directory | when job starts | This is the default name of the stdout and stderr for the job. |
| slurm-<JOBID>.info | submission directory | when job starts | Lists the node(s) running the job, and other info. |
| slurm-<JOBID>.files | submission & job directory | when job ends | Lists where the files will be put on the front end. |
| slurm-<JOBID>.hosts | job directory | when job ends | List of nodes used during run |
Once the job is finished, the run directory is sent back to one of two places, either at the top of level of your global scratch directory or in the original submission directory. To be clear, the job directory is returned to the submission directory unless the job was submitted from your home directory, in which case it is sent to the top level of the global scratch space (for now it defaults to /scr1).
| submission path | job return path |
|---|---|
| home directory /home/<USER>/* | top of user scr1 directory |
| user's global scratch directory /home/<USER>/scr1/* or /home/<USER>/scr2/* | returned to submission path |
Since the job was submitted from my home directory, the job directory is returned to the top of my global scratch space in <JOBID>.run
|
#!/bin/csh #SBATCH -J parallel #SBATCH -n 12 #SBATCH -t 30 #SBATCH --mail-type=END prepdir cd $JOBDIR #If using openMPI (the default): #mpirun -np 12 ./a.out > OUTFILE #If using mvapich2: srun -n 12 ./a.out > OUTFILE #If using mvapich1: srun -n 12 --mpi=mvapich ./a.out > OUTFILE |
Besides the number of cores (now 12) , the only significant change compared to a serial run is how to start the executable. As shown on the software page, there are three separate Infiniband enabled mpi environments, openmpi, mvapich & mvapich2 ,. Users have the openmpi-1.2.5 module loaded by default. Openmpi uses mpirun to start job whereas both flavors of mvapich do not.
The master node is indicated by the SLURMD_NODENAME variable in the .info file. The progress of the job can be checked by logging into this node via ssh and looking in the /lscr/<USER>/<JOBID>.run directory
#!/bin/csh #SBATCH -J parallel #SBATCH -n 3 #SBATCH -t 30 #SBATCH --mail-type=END mpirun -np 3 ./a.out > OUTFILE |
Simply by removing the prepdir command, SLURM knows to not bother with sending and retrieving files. Users MAY NOT run this way from their home directory, only from a directory in the global scratch space /home/<USER>/scr1/* or /home/<USER>/scr2/*. Also, users may not run a code that requires excessive I/O and/or significantly burdens the NFS network.