Skip to content

SLURM Scheduler

SLURM is a scalable open-source scheduler used on a number of world class clusters. In an effort to align CHPC with XSEDE and other national computing resources, CHPC has switched clusters from the PBS scheduler to SLURM. There are several short training videos about Slurm and concepts like batch scripts and interactive jobs.

About Slurm

Slurm  – Simple Linux Utility for Resource Management is used for managing job scheduling on clusters. It was originally created by people at the Livermore Computing Center, and has grown into a full-fledge open-source software backed up by a large community, commercially supported by the original developers, and installed in many of the Top500 supercomputers. 

Using Slurm

There is a hard limit of maximum 72 hours for jobs on general cluster nodes and 14 days on owner cluster nodes. 

You may submit jobs to the batch system in two ways: 

  • Submitting a script 
  • Submitting an interactive job

 

Submitting a script to Slurm:

The creation of a batch script

To create a batch script, use your favorite text editor to create a file which has both instructions to SLURM and instructions on how to run your job. All instructions to slurm are prefaced by the #SBATCH. It is necessary to specify both the partition and the account in your jobs on all clusters EXCEPT tangent.

#SBATCH --account=<youraccount>

#SBATCH --partition=<yourpartition>

Accounts: Your account is usually your unix group name, typically your PI's lastname. If your group has owner nodes, the account is usually <unix_group>-<cluster_abbreviation>  (where cluster abbreviation is  =kp, lp, notch, ash). There is also the owner-guest account; all users have access to this account to run on the owner nodes when they are idle.  Jobs run as owner-guest are preemptable. Note that on the ash cluster, the owner-guest account is called smithp-guest.

Partitions:  Partitions are cluster, cluster-freecycle,pi-cl, cluster-guest, where cluster is the full name  of the cluster and cl is the abbreviated form (kingspeak and kp, notchpeak and notch, ash and ash, lonepeak and lp, redwood and rw). 

To view a list of accounts and partitions that are available to you, run command myallocation .

Examples

In the examples below, we will suppose your PI is Frodo Baggins and has owner nodes on kingspeak (and not on notchpeak):

  • General user  example on lonepeak (no allocation required)
    #SBATCH --account=baggins
    #SBATCH --partition=lonepeak
  • General user on notchpeak with allocation (Frodo still has allocation available on notchpeak): 
    #SBATCH --account=baggins
    #SBATCH --partition=notchpeak
  • General users on notchpeak without allocation: (Frodo has run out of allocation)
    #SBATCH --account=baggins
    #SBATCH --partition=notchpeak-freecycle
  • To run on Frodo's owner nodes on kingspeak
    #SBATCH --account=baggins-kp
    #SBATCH --partition=baggins-kp
  • To run as owner-guest on notchpeak:
    #SBATCH --account=owner-guest
    #SBATCH --partition=notchpeak-guest
  • To run as owner-guest on ash:
    #SBATCH --account=smithp-guest
    #SBATCH --partition=ash-guest
  • To access notchpeak GPU nodes (need to request addition to account)
    #SBATCH --account=notchpeak-gpu
    #SBATCH --partition=notchpeak-gpu
  • To access kingspeak GPU nodes (need to request addition to account)
    #SBATCH --account=kingspeak-gpu
    #SBATCH --partition=kingspeak-gpu
NOTE When specifying an account or paritition, you may use either an equals sign or a space before the account or parition name, but you may not use both. For example, "#SBATCH --account=kingspeak-gpu" and "#SBATCH --account kingspeak-gpu" are acceptable, but "#SBATCH --account = kingspeak-gpu" is not.
 

For more examples of SLURM jobs scripts see CHPC MyJobs templates.

IMPORTANT The biggest change in moving from torque/moab to slurm will come when you are out of allocation. At that point you will no longer have access to the cluster partition and will have to manually change your scripts to use the cluster-freecycle partition.

 

Constraints: Constraints, specified with the  line #SBATCH -C, can be used to target specific nodes. The ability to use constraints is done in conjunction with  the specification of node features, which is an  extension that allow for finer grained specification of resources.

 Features that we have specfied are:

  • Core count on node: The features are requested with the --constraint or -C flag (described in table below) and the core count is denoted as c#, e.g., c8, c12, c16, c20, c24, c28, c32. Features can be combined with logical operators, such as | for or, & for and. For example, to request 16 or 20 core nodes, do  #SBATCH -C "c16|c20" .  
  • Amount of memory per node:  This constraint takes the form of m#, where the number is the amount, in GB) of memory in the node, e.g., m32,m64, m96, m192.   IMPORTANT: There is a difference between the use of the memory constraint, #SBATCH -C m32 , versus the batch directive #SBATCH --mem=32000 .  When using the memory batch directive you specify the number as it appears in the MEMORY entry from the  si  command, which is in MB.  This will result in the job only being eligible to run on a node with at least this amount of memory.  However, it also restrict the job to being able to use only this amount of memory, even if the node has more memory than this value.  When using the constraint, the job will only run on a node with this constraint, and the job will have access to all of the memory of the node.
  • Node owner:  This is given as either chpc for the nodes available to all user (general nodesO  or the group/center name for owner nodes. This can be used as a constraint to target specific group nodes which have low use as owner-guest in order to reduce chances of being preempted.  In the output of the  si  command given above the  column NODES(A/I/O/T)  provides the number of nodes that are allocated/idle/offline/total, allowing for the identification of owner nodes that are not being utilized. For example, to target nodes used by group "ucgd", we can do -A owner-guest -p kingspeak-guest -C "ucgd".  Historical usage (the past 2 weeks) of different owner node groups can be found at CHPC's constraint suggestion page.
  • GPUs: For the GPU nodes, the specified features includes the GPU line, e.g., geforce or tesla , and the GPU type, e.g., a100, 3090, or t4. There is additional information about specifying the GPUs being requested for a job on CHPC's GPU and Accelerator page.
  • Processor architecture: This is currently only on notchpeak and redwood. This is useful for jobs where you want to restrtict the processor architecture to be used on the job. Examples are bwk for Intel Broadwell,  skl for Intel Skylake processors, csl for Intel Cascade Lake,  icl for Intel Icelake, npl for AMD Naples, rom for AMD Rome, mil for AMD Milan.

Below is a portion of the output of the  si  command (see useful aliases section below for information on this command) run on notchpeak, which provides a list of the features for each group of nodes:

PARTITION                 NODES  NODES(A/I/O/T)    S:C:T     MEMORY     TMP_DISK     TIMELIMIT     AVAIL_FEATURES                 
NODELIST
notchpeak-gpu                   3          2/1/0/3                  2:16:2   188000           1800000         3-00:00:00  chpc,tesla,v100,skl,c32,m192 
 notch[001-003]
notchpeak-gpu                    1           1/0/0/1                 2:16:2   188000           1800000         3-00:00:00  chpc,geforce,2080ti,tesla,p40,skl,c32,m192
 notch004
notchpeak-gpu                    4           4/0/0/4                 2:20:2   188000            1780000+      3-00:00:00  chpc,geforce,2080ti,csl,c40,m192
notch[086-088,271]
notchpeak-gpu                    1           1/0/0/1                 8:6:2     252000             1760000        3-00:00:00  chpc,geforce,3090,mil,c48,m256            
 notch328
notchpeak-gpu                    1           0/1/0/1                 2:32:2   508000            1690000         3-00:00:00  chpc,geforce,3090,a100,rom,c64,m512        
notch293
notchpeak-shared-short   2          1/1/0/2                 2:26:2   380000            1800000              8:00:00     chpc,t4,csl,c52,m384
 notch[308-309]
notchpeak-shared-short   2           0/2/0/2                2:32:2   508000             1700000             8:00:00     chpc,tesla,k80,npl,c64,m512                
notch[081-082]
notchpeak*                             4          4/0/0/4                  2:16:2   92000              1800000           3-00:00:00  chpc,skl,c32,m96    
notch[005-008]
notchpeak*                           19        13/6/0/19               2:16:2   188000           1800000           3-00:00:00  chpc,skl,c32,m192   
 notch[009-018,035-043]
notchpeak*                             7           4/3/0/7                  2:20:2   188000           1800000          3-00:00:00  chpc,csl,c40,m192   
notch[096-097,106-107,153-155]
notchpeak*                           32       14/18/0/32              4:16:2   252000           3700000          3-00:00:00  chpc,rom,c64,m256   
 notch[172-203]
notchpeak*                              2         1/1/0/2                   2:16:2   764000           1700000         3-00:00:00  chpc,skl,c32,m768   
 notch[044-045]
notchpeak*                             1          0/1/0/1                  2:18:2   764000            7400000         3-00:00:00  chpc,skl,c36,m768   
 notch068

 

Reservations: Upon request we can create reservations for users to guarantee node availability via an email to helpdesk@chpc.utah.edu. Reservation are requested with the --reservation  flag (abbreviated as -R ) followed by the reservation name, which consists of a user name followed by a number, e.g.u0123456_1. Thus to use an existing reservation in a job script, include  #SBATCH --reservation=u0123456_1 .

QOS: QOS stands for Quality of Service.  While this is not normally specified, it is necessary in a few cases.  One example is when a user needs to override the normal 3 day wall time limit.  In this case, the user can request access to a special long qos that we have set up for the general nodes of a cluster,  cluster-long , that allow for a longer wal ltime to be specified.  In order to get access to the long qos of a given clusters, send a request with an explanation on why you need a longer wall time to helpdesk@chpc.utah.edu.

For policies regarding reservations see the Batch Policies document.

Sample MPI job Slurm script

#!/bin/csh
#SBATCH --time=1:00:00 # walltime, abbreviated by -t
#SBATCH --nodes=2      # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH -e slurm-%j.err-%N # name of the stderr, using job and first node values
#SBATCH --ntasks=16 # number of MPI tasks, abbreviated by -n # additional information for allocated clusters #SBATCH --account=baggins # account - abbreviated by -A #SBATCH --partition=lonepeak # partition, abbreviated by -p
# # set data and working directories
setenv WORKDIR $HOME/mydata setenv SCRDIR /scratch/general/vast/$USER/$SLURM_JOBID mkdir -p $SCRDIR
cp -r $WORKDIR/* $SCRDIR
cd $SCRDIR
#
# load appropriate modules, in this case Intel compilers, MPICH2
module load intel mpich2
# for MPICH2 over Ethernet, set communication method to TCP
# see below for network interface selection options for different MPI distributions
setenv MPICH_NEMESIS_NETMOD tcp # run the program
# see below for ways to do this for different MPI distributions mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

The #SBATCH option denotes the SLURM flags. The rest of the script is instructions on how to run your job. Note that we are using the SLURM built in $SLURM_NTASKS variable to denote the number of MPI tasks to run. In case of a plain MPI job, this number should equal number of nodes ($SLURM_NNODES) times number of cores per node.

Also note that some packages have not been built with the MPI distributions that support Slurm, in which case you'll need to specify the hosts to run on via machinefile flag to the mpirun command and the appropriate MPI distribution. Please, see the package help page for details and the appropriate script. Additional information on creation of a machinefile is also given in a table below discussing SLURM environmental variables.

For mixed MPI/OpenMP runs, you can either hard code the OMP_NUM_THREADS in the script, or, use logic like that below to figure it out from the Slurm job information. When requesting resources, ask for number of MPI tasks and number of nodes to run on, not for total number cores the MPI+OpenMP tasks will use.

#SBATCH -N 2
#SBATCH -n 4
#SBATCH -C "c12" # we want to run on uniform core count nodes

# find number of threads for OpenMP
# find number of MPI tasks per node
set TPN=`echo $SLURM_TASKS_PER_NODE | cut -f 1 -d \(`
# find number of CPU cores per node
set PPN=`echo $SLURM_JOB_CPUS_PER_NODE | cut -f 1 -d \(`
@ THREADS = ( $PPN / $TPN )
setenv OMP_NUM_THREADS $THREADS
# set thread affinity to CPU socket
setenv KMP_AFFINITY verbose,granularity=core,compact,1,0

mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

 Alternatively, SLURM option -c , or  --cpus-per-task can be used, like:

#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 6

setenv OMP_NUM_THREADS $SLURM_PUS_PER_TASK
mpirun -np $SLURM_NTASKS my_mpi_program > my_program.out

Note that if you use this on a cluster with nodes that have varying core counts, SLURM is free to pick any node so the job nodes may be under subscribed (e.g. on ash, the above option would fully subscribe 12 core nodes, but, under subscribe the 20 or 24 core nodes).  

NOTE specific to MPI jobs: We have received reports that when running MPI jobs under certain circumstances (specifically when the job does not have any initial setup and therefore starts with the mpirun step) a race condition can occur where the job tries to start before the worker nodes are ready, resulting in this error:

Access denied by pam_slurm_adopt: you have no active jobs on this node
Authentication failed.

 

In this case, the solution is to add a sleep before the mpirun:

sleep 30

more info for pam_slurm_adopt:  This issue stems from a tool CHPC uses on all clusters "pam_slurm_adopt" to help capture processes that end up being started by ssh that would otherwise land outside the cgroup and get them into the right cgroup. To do this the pam_slurm_adopt has to have the remote system talk back with the node the mpirun/ssh call was made on to find out what job the remote call came from to see if that job is on the new node and then to adopt this process into the cgroup. 'srun' on the other hand goes through the usual slurm paths that does not cause the same back and forth callbacks as it spawns the remote process right into the cgroup. 

New July 2020 - NOTE specific to use of /scratch/local: Users can no longer create directories in the top level /scratch/local directory. Instead, as part of the slurm job prolog (before the job is started), a job level directory,/scratch/local/$USER/$SLURM_JOB_ID , will be created.  Only the job owner will have access to this directory.   At the end of the job, in the slurm job epilog, this job level directory will be removed. As an example for a script for a single node job making use of /scratch/local:

#!/bin/csh
						#SBATCH --time=1:00:00 # walltime, abbreviated by -t
						#SBATCH --nodes=1      # number of cluster nodes, abbreviated by -N
#SBATCH -o slurm-%j.out-%N # name of the stdout, using the job number (%j) and the first node (%N)
#SBATCH -e slurm-%j.err-%N # name of the stderr, using job and first node values #SBATCH --account=baggins # account - abbreviated by -A #SBATCH --partition=lonepeak # partition, abbreviated by -p # # set data and working directories
setenv WORKDIR $HOME/mydata setenv SCRDIR /scratch/local/$USER/$SLURM_JOB_ID
cp -r $WORKDIR/* $SCRDIR
cd $SCRDIR
#
# load appropriate modules
module load yourappmodule # run the program myprogram myinput > myoutput

# copy output back to WORKDIR
cp $SCRDIR/myoutput $WORKDIR/.

Note that if your script currently does mkdir -p /scratch/local/$USER/$SLURM_JOB_ID it will still run properly with this change. Also note that depending on your program, you may want to run it such that  any output necessary for  restarting your program be written to your home directory or group space instead of to the local scratch space. 

Job Submission using SLURM

In order to submit a job, one has to first login to an interactive node. Then the job submission is done with the sbatch command in slurm.

For example, to submit a script named script.slurm just type:

  • sbatch script.slurm

IMPORTANT: sbatch by default passes all environment variables to the compute node, which differs from the behavior in PBS (which started with a clean shell). If you need to start with a clean environment, you will need to use the following directive in your batch script:

  • #SBATCH --export=NONE

This will still execute .bashrc/.tcshrc scripts, but any changes you make in your interactive environment will not be present in the compute session. As an additional precaution, if you are using modules, you should use  module purge to guarantee a fresh environment.

 Checking the status of your job

To check the status of your job, use the squeue command.

  • squeue

Most common arguments to are -u u0123456 for listing only user u0123456 jobs, and -j job# for listing job specified by the job number. Adding -l (for "long" output) gives more details.

Alternatively, from the account perspective, one can use the sacct command. This command accesses the accounting database and can give useful info about current and past job resources usage.

 Interactive batch jobs

In order to launch an interactive session on a compute node do:

salloc --time=1:00:00 --ntasks 2 --nodes=1 --account=chpc --partition=notchpeak 

The srun flags can be abbreviated as:

salloc -t 1:00:00 -n 2 -N 1 -A chpc -p notchpeak

The above command by default passes all environment variables of the parent shell therefore the X window connection gets preserved as well, allowing for running graphical applications such as GUI based programs inside the interactive job.

CHPC cluster queues tend to be very busy; it may take some time for an interactive job to start. For this reason, in March 2019, we have added two nodes in a special partition on the notchpeak cluster that are geared more towards interactive work. Job limits on this partition are 8 hours wall time,  a maximum ten submitted jobs per user, with a maximum of two running jobs with a maximum total of 32 tasks and 128 GB memory.  To access this special partition, called notchpeak-shared-short, request both an account and partition under this name, e.g.:

salloc -N 1 -n 2 -t 2:00:00 -A notchpeak-shared-short -p notchpeak-shared-short

Running MPI jobs

One option is to produce the hostfile and feed it directly to the mpirun command of the appropriate MPI distribution. The disadvantage of this approach is that it does not integrate with SLURM and as such it does not provide advanced features such as task affinity, accounting, etc.

Another option is to use process manager built into SLURM and launch the MPI executable through srun command. How to do this for various MPI distributions is described at http://slurm.schedmd.com/mpi_guide.html. Some MPI distributions' mpirun commands integrate with Slurm and thus it is more convenient to use them instead of srun.

For MPI distributions at CHPC, the following works (assuming MPI program internally threaded with OpenMP).

Intel MPI
module load [intel,gcc] impi
# for a cluster with Ethernet only, set network fabrics to TCP
setenv I_MPI_FABRICS shm:tcp
# for a cluster with InfinBand, set network fabrics to OFA
setenv I_MPI_FABRICS shm:ofa
# on lonepeak owner nodes, use the TMI interface (InfiniPath)
setenv I_MPI_FABRICS shm:tmi

# IMPI option 1 - launch with PMI library - currently not using task affinity, use mpirun instead
setenv I_MPI_PMI_LIBRARY /uufs/CLUSTER.peaks/sys/pkg/slurm/std/lib/libpmi.so
#srun -n $SLURM_NTASKS $EXE >& run1.out
# IMPI option 2 - bootstrap
mpirun -bootstrap slurm -np $SLURM_NTASKS $EXE  >& run1.out
MPICH2

Launch the MPICH2 jobs with mpiexec as explained in http://slurm.schedmd.com/mpi_guide.html#mpich2. That is:

module load [intel,gcc,pgi] mpich2
setenv MPICH_NEMESIS_NETMOD mxm # default is Ethernet, choose mxm for InfiniBand
mpirun -np $SLURM_NTASKS $EXE
OpenMPI

Use the mpirun command from the OpenMPI distribution. There's no need to specify the hostfile as OpenMPI communicates with Slurm in that regard. To run:

module load [intel,gcc,pgi] openmpi
mpirun --mca btl tcp,self -np $SLURM_NTASKS $EXE # in case of Ethernet network cluster, such as general lonepeak nodes.
mpirun -np $SLURM_NTASKS $EXE # in case of InfiniBand network clusters

Note that OpenMPI supports multiple network interfaces and as such it allows for single MPI executable across all CHPC clusters, including the InfiniPath network on lonepeak.

MVAPICH2

MVAPICH2 executable can be launched with mpirun command (preferably) or with srun, in which case one needs to use --mpi=none flag. To run multi-threaded code, make sure to set OMP_NUM_THREADS and MV2_ENABLE_AFFINITY=0 (ensure that the MPI tasks don't get locked to single core) before calling the srun.

module load [intel,gcc,pgi] mvapich2
setenv OMP_NUM_THREADS 6 # optional number of OpenMP threads
setenv MV2_ENABLE_AFFINITY 0 # disable process affinity - only for multi-threaded programs
mpirun -np $SLURM_NTASKS $EXE # mpirun is recommended
srun -n $SLURM_NTASKS --mpi=none $EXE # srun is optional
x

Running shared jobs and running multiple serial calculations within one job

On July 1, 2019 node sharing was enabled on all CHPC clusters in the general environment. This is the best option when a user has a single job that does not need an entire node. For more information see the NodeSharing page.

Please note in cases where a user has many calculations, whether they need a portion of a node or the entire node, to submit at the same time, there may be better options than submitting each calculation as a separate batch submission; for these cases see our page dedicated to  running multiple serial jobs for details as well as the next section on job arrays.

Multiple jobs using job arrays

Job arrays enable quick submission of many jobs that differ from each other only slightly by a some sort of index. In this case Slurm provides environment variable SLURM_ARRAY_TASK_ID which serves as a differentiator between the job. For example, if our program takes input data input.dat, we can have it running using 30 different input data stored in files input[1-30].dat using the following script, named myrun.slr:

#!/bin/tcsh
#SBATCH -J myprog # A single job name for the array
#SBATCH -n 1 # Number of tasks
#SBATCH -N 1 # All tasks on one machine
#SBATCH -p CLUSTER # Partition on some cluster
#SBATCH -A chpc # General CHPC account
#SBATCH -t 0-2:00 # 2 hours (D-HH:MM)
#SBATCH -o myprog%A%a.out # Standard output
#SBATCH -e myprog%A%a.err # Standard error

./myprogram input$SLURM_ARRAY_TASK_ID.dat 

We then use the --array  parameter to run this script:

sbatch --array=1-30 myscript.sh

Apart from SLURM_ARRAY_TASK_ID which is an environment variable unique for each job array job, notice also %A and %a, which represent the job id and the job array index, respectively. These can be used in the sbatch parameters to generate unique names.

You can also limit the number of jobs that can be running simultaneously to "n" by adding a %n after the end of the array range:

sbatch --array=1-30%5 myscript.sh

When submitting applications that utilize less than the full CPU count per node, please make sure to use the shared partitions, to allow multiple array jobs on one node. For more information see the NodeSharing page. Also see our document detailing various ways for running multiple serial jobs.

Automatic restarting of preemptable jobs

The owner-guest orfreecycle  queues tend to have quicker turnaround than general queues. However, the guest jobs may get preempted. If one's job is checkpointed (e.g. by saving particle positions and velocities in dynamics simulations, property values and gradients in minimizations, etc), one can automatically restart a preempted job following this strategy:

  1. Right at the beginning of the jobs script submit a new job with dependency on the current job. This will ensure that the new job will be eligible for running only after the current job is preempted (or finished). Save the new job submission information into a file - this file contains a job ID of the new job, which we save to an environment variable NEWJOB
    sbatch -d afterany:$SLURM_JOBID run_ash.slr >& newjob.txt
    set NEWJOB=`cat newjob.txt |cut -f 4 -d " "`
  2. In the simulation output, include a file that lists the last checkpointed iteration, time step, or other measure of the simulation progress. In our example below, we are having a file called inv.append which, among other things, contains lines on simulation iterations, one per line.
  3. In the job script, extract the iteration number from this file and put it into the simulation input file (here called inpt.m). This input file will be used when the simulation is restarted. Since the simulation file does not exist at the very start of the simulation, the first job will not append the input file - and thus begin from the start.  
    set ITER=`cat $SCRDIR/$RUNNAME/work_inv/inv.append | grep Iter |tail -n 1 | cut -f 2 -d " " | cut -f 1 -d /`
    if ($ITER != "") then
    echo "restart=$ITER;" >> inpt.m
    endif
  4. Run the simulation, if the job gets preempted, the current job will end here. If it runs through completion, then at the end of the job script, make sure to delete the new job identified by environment variable NEWJOB that was submitted when this job was started.
    scancel $NEWJOB

In summary, the whole SLURM script (called run_ash.slr) would look like this

#SBATCH all necessary job settings (partition, walltime, nodes, tasks)
#SBATCH -A owner-guest

# submit a new job dependent on the finish of the current job
sbatch -d afterany:$SLURM_JOBID run_ash.slr >& newjob.txt
# get this new job job number
set NEWJOB=`cat newjob.txt |cut -f 4 -d " "`
# figure out from where to restart
set ITER=`cat $SCRDIR/$RUNNAME/work_inv/inv.append | grep Iter |tail -n 1 | cut -f 2 -d " " | cut -f 1 -d /`
if ($ITER != "") then
echo "restart=$ITER;" >> inpt.m
endif

# copy input files to scratch
# run simulation
# copy results out of the scratch

# delete the job if the simulation finished
scancel $NEWJOB

Handy Slurm Information

Slurm User Commands

 Slurm Command  What it does
 sinfo

 reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options. For a personalized view, showing only information about the partitions to which you have access, see mysinfo.

 squeue  reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order. For a personalized view, showing only information about the jobs in the queues/partitions to which you have access, see mysqueue.
 sbatch  is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
 scancel  is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
 sacct  is used to report job or job step accounting information about active or completed jobs.
 srun  is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.
spart list partitions and their utilization
pestat list efficiency of cluster utilization on per node, user or partition basis. By default it prints utilization of all cluster nodes, to select only nodes utilized by an user, run pestat -u $USER . This command is very useful in checking if your jobs are running efficiently.

Useful Slurm aliases

Bash to add to .aliases file:
#SLURM Aliases that provide information in a useful manner for our clusters
alias si="sinfo -o \"%20P %5D %14F %8z %10m %10d %11l %32f %N\""
alias si2="sinfo -o \"%20P %5D %6t %8z %10m %10d %11l %32f %N\""
alias sq="squeue -o \"%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R\""

Tcsh to add to .aliases file:
#SLURM Aliases that provide information in a useful manner for our clusters
alias si 'sinfo -o "%20P %5D %14F %8z %10m %11l %32f %N"'
alias si2 'sinfo -o "%20P %5D %6t %8z %10m %10d %11l %32f %N"'
alias sq 'squeue -o "%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R"'

sview GUI tool

sview is a graphical user interface to view and modify Slurm state. Run it by typing sview. It is useful for viewing partitions and nodes characteristics and information on jobs. Right clicking on the job, node or partition allows to perform actions on them, though, use this carefully not to accidentally modify or remove your job.

 sview

Moab/PBS to Slurm translation

Moab/PBS to Slurm commands

Action  Moab/Torque  Slurm
Job Submission msub/qsub sbatch
Job deletion canceljob/qdel scancel
List all jobs in queue showq/qstat squeue
List all nodes   sinfo
Show information about nodes mdiag -n/pbsnodes scontrol show nodes 
Job start time showstart squeue --start
Job information checkjob scontrol show job <jobid>
Reservation information showres

scontrol show res (this option shows details)

sinfo -T

 Moab/PBS to Slurm environmental variables

Description  Moab/Torque  Slurm
 Job ID  $PBS_JOBID $SLURM_JOBID
 node list   $PBS_NODEFILE

 Generate a listing of 1 node per line:
srun hostname | sort -u > nodefile.$SLURM_JOBID

Generate alisting of 1 core per line: 

srun hostname | sort  > nodefile.$SLURM_JOBID

 

submit directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR
number of nodes   $SLURM_NNODES
number of processors (tasks)   $SLURM_NTASKS ($SLURM_NPROCS for backward compatibility)

Moab/PBS to Slurm job script modifiers

 

Description  Moab/Torque  Slurm
Walltime #PBS -l walltime=1:00:00 #SBATCH -t 1:00:00 (or --time=1:00:00)
Process count

#PBS -l nodes=2:ppn=12

#SBATCH -n 24 ( or --ntasks=24)
#SBATCH -N 2 (or --nodes=2)

For threaded MPI jobs, use number of MPI tasks for --ntasks,
not number of cores. See the example script above for how
to figure out number of threads per MPI task

Memory #PBS -l nodes=2:ppn=12:m24576

#SBATCH --mem=24576

it is also possible to specify memory per tash with --mem-per-cpu; also see constraint section above for additional infomraiton on the use of this.

Mail options #PBS -m abe

#SBATCH --mail-type=FAIL,BEGIN,END 
there are other options such as REQUEUE, TIME_LIMIT_90. ...

Mail user #PBS -M user@mail.com  #SBATCH --mail-user=user@mail.com
Job name and
STDOUT/STDERR
#PBS -N myjob

#SBATCH -o myjob-%j.out-%N
#SBATCH -e myjob-%j.err-%N

NOTE: The %j and %N are replaced by the job number and the node (first node if a multi-node job.  This gives the stderr and stdout a unique name for each job.

Account #PBS -A owner-guest
optional in Torque/Moab

#SBATCH -A owner-guest (or --account=owner-guest)
required in Slurm

Dependency #PBS -W depend=afterok:12345
run after job 12345 finishes correctly

#SBATCH -d afterok:12345 (or --dependency=afterok:12345)
similarly to Moab, other modifiers include after, afterany, afternotok.
Please note that if job runs out of walltime, this does not constitute OK exit. To start a job after specified job finished use afterany.
For details on job exit codes see http://slurm.schedmd.com/job_exit_code.html

Reservation #PBS -l advres=u0123456_1

#SBATCH -R u0123456_1 (or --reservation=u0123456_1)

Partition No direct equivalent

#SBATCH -p lonepeak (or --partition=lonepeak)

Propagate all environment
variables from terminal
#PBS -V  All environment variables are propagated by default, except for modules
which are purged at a job start to prevent possible inconsistencies.
One can either load the needed modules in the job script,
or have them in their .custom.[sh,csh] file.
Propagate specific
environment variable
#PBS -v myvar #SBATCH --export=myvar
use with caution as this will export ONLY variable myvar

Target specific owner
nodes as guest

#PBS -l nodes=1:ppn=24:ucgd -A owner-guest #SBATCH -A owner-guest -p kingspeak-guest -C "ucgd"

Target specific nodes  

  #SBATCH -w notch001,notch002 (or --nodelist=notch001,notch002)

 Information about job priority

Note that this applies to the general resources and not to owner resources on the clusters. 

The first and most significant portion of a jobs priority are based on the account being used and if it has allocation or not.  Jobs run with allocation have a base priority of 100,000.  Jobs without have a base priority of 1.  

To this, there are additional values added for:

(1) Age (time a job spends in the queue) -- For the "Age" of a job we will see a somewhat linear growth of the priority until it hits a cap.  The cap is a limit we put on how much time a job can accrue extra priority in the queue.

(2) Fairshare (how much of the system you have used recently) -- Fairshare is a factor based on the historical usage of a user.  All things being equal the user that has used the system the less recently should have a bonus to priority over the user that has used the system more recently.  This value though is somewhat more of an exponential behavior as compared to the other two.

(3) JobSize (how many nodes/cores our job is requesting) -- Job size is again a linear value according to the number of resources requested.  It is fixed at submit time according to the requested resources.

At any point you can run 'sprio' and see the current priority as well as the source of the priority (in terms of the three components mentioned above) for all idle jobs in the queue on a cluster.

UPDATE- 21 January 2020:   With the new version of slurm installedearlier this month we have the ability to limit, on a  per qos basis, the number of pending jobs  per user that accrue priority based on the age factor.  This limit has been set to 5.  At the same time we have set a limit of the number of jobs a user can submit per qos to 1000 (was already set for a few of the qos'es but was changed to be set for each qos).

How to determine which Slurm accounts you are in

In order to see which accounts and partitions you can use, do:

sacctmgr -p show assoc user=<UNID> 

The output of this command can be difficult to interpret; a wrapper to print a formatted version follows:

printf "%-15s%-25s%s\n" "Cluster" "Account" "Partition" && sacctmgr -p show assoc user=$USER | awk -F'|' 'NR>1 { printf "%-15s%-25s%s\n", $1, $2, $18 }' | sort

If you find yourself using this command often, you can create an alias by escaping some of the parameters. For example, in bash, add the following to your ~/.aliases file, then source ~/.aliases:

alias myslurmaccts='printf "%-15s%-25s%s\n" "Cluster" "Account" "Partition" && sacctmgr -p show assoc user=$USER | awk -F"|" "NR>1 { printf \"%-15s%-25s%s\n\", \$1, \$2, \$18 }" | sort'

There are a few exceptions to the account-partition mapping that will not be correct using the above method.

Another option is to use the myallocation command that we have developed for this purpose. However, there is a chance that we may have missed some of these exceptions in its logic, so, if you notice any irregularity in you account-partition mapping, please, let us know.

How to log into the nodes where the job runs

Sometimes it is useful to connect to the nodes where a job runs, to for example monitor if the executable is running correctly and efficiently. For that, the best way is to ssh to the nodes that the job runs on. We allow users with jobs on compute nodes to ssh to these compute nodes. First, find the nodes where the job runs with the squeue -u $USER command, and then ssh to these nodes.

Other good sources of information

Last Updated: 3/5/24