Scheduling Jobs at the CHPC with Slurm
Slurm is a scalable, open-source scheduler used by over 60% of the world's top clusters and supercomputers. There are several short training videos about Slurm, including concepts such as batch scripts and interactive jobs.
On this page
The table of contents requires JavaScript to load.
About Slurm
Slurm – Simple Linux Utility for Resource Management, is used for managing job scheduling on clusters. It was originally created by people at the Livermore Computing Center and has grown into a full-fledged open-source software backed up by a large community, commercially supported by the original developers, and is installed in many of the Top500 supercomputers. The Slurm development team is based close by to the University of Utah in Lehi, Utah.
You may submit jobs to the Slurm batch system in two ways. 1: submitting a batch script or 2: submitting an interactive job
Why submit your job through a slurm batch script?
Batch jobs are designed to run for many hours or days without user intervention. You don't need to stay logged in or keep a web browser open. Ideal for tasks that don't require real-time viewing, such as simulations, large data processing, or large compilations, which only produce an output file.
Use a Slurm batch scriptWhy use an interactive session?
An interactive session gives you a real-time connection to a compute resource, often with a graphical desktop or a live terminal. Essential for testing small pieces of code, debugging scripts, or checking inputs/outputs in real-time. You get immediate feedback on commands.
Use an interactive sessionSpecial Circumstances
Slurm Reservations
Upon request we can create reservations for users to guarantee node availability via
an email to helpdesk@chpc.utah.edu. Once a reservation is in place, reservations can be passed to Slurm with the --reservation flag (abbreviated as -R ) followed by the reservation name.
For policies regarding reservations see the Batch Policies document.
Quality of Service (QOS)
Every account (found through the 'mychpc batch' command) is associated with at least one QOS, otherwise known as Quality of Service. The QOS dictates a job's base priority. In some cases, there may be multiple QOS's associated with a single account that differ on preemption status and maximum job walltime.
One example of multiple QOS's to a single account is when a user needs to override
the normal 3 day wall time limit. In this case, the user can request access to a special
long QOS that we have set up for the general nodes of a cluster, <cluster>-long, that allow for a longer wall time to be specified. In order to get access to the
long QOS of a given cluster, send a request with an explanation on why you need a
longer wall time to helpdesk@chpc.utah.edu.
Requesting GPUs
If you would like to request GPU resources, please refer to the following page. It also includes information on how to use the GPUs.
Handy Slurm Information
Slurm User Commands
| Slurm Command | What it does |
|---|---|
| sinfo |
reports the state of partitions and nodes managed by Slurm. It has a wide variety of filtering, sorting, and formatting options. For a personalized view, showing only information about the partitions to which you have access, see mysinfo. |
| squeue | reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order. For a personalized view, showing only information about the jobs in the queues/partitions to which you have access, see mysqueue. |
| sbatch | is used to submit a job script for later execution. The script will typically contain one or more #SBATCH directives. |
| scancel | is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step. |
| sacct | is used to report job or job step accounting information about active or completed jobs. |
| srun | is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation. |
| spart | list partitions and their utilization |
| pestat | list efficiency of cluster utilization on per node, per user, or per partition basis.
By default it prints utilization of all cluster nodes. To select only nodes utilized
by an user, run pestat -u $USER. |
Useful Slurm Aliases
A SLURM alias is a shortcut you define in your shell's configuration file (.aliases file) that executes a long, complex SLURM command using a simple, easy-to-remember
command.
How to Use SLURM Aliases:
-
Installation: Copy the appropriate lines (Bash or Tcsh) into your user's
.aliasesfile (or equivalent shell startup file) on the CHPC login node. -
Activation: After adding the aliases, you must source the file (e.g.,
source ~/.aliasesor log out and log back in) to activate the changes. -
Execution: Instead of typing the full, complicated
sinfoorsqueuecommands with all the formatting flags, you simply type the alias si, si2, or sq
Bash to add to .aliases file:
#SLURM Aliases that provide information in a useful manner for our clusters
alias si="sinfo -o \"%20P %5D %14F %8z %10m %10d %11l %32f %N\""
alias si2="sinfo -o \"%20P %5D %6t %8z %10m %10d %11l %32f %N\""
alias sq="squeue -o \"%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R\""
Tcsh to add to .aliases file:
#SLURM Aliases that provide information in a useful manner for our clusters
alias si 'sinfo -o "%20P %5D %14F %8z %10m %11l %32f %N"'
alias si2 'sinfo -o "%20P %5D %6t %8z %10m %10d %11l %32f %N"'
alias sq 'squeue -o "%8i %12j %4t %10u %20q %20a %10g %20P %10Q %5D %11l %11L %R"'
Other CHPC Documentation on Slurm
Looking for more information on running Slurm at the CHPC? Check out these pages. If you have a specific question, please don't hesitate to contact us at helpdesk@chpc.utah.edu.
Slurm Job Preemption and Restarting of Jobs
Slurm Priority Scoring for Jobs
Running Independent Serial Calculations with Slurm
Accessing CHPC's Data Transfer Nodes (DTNs) through Slurm
Other Slurm Constraint Suggestions and Owner Node Utilization
Sharing Nodes Among Jobs with Slurm
Other Good Sources of Information
- http://slurm.schedmd.com/pdfs/summary.pdf This is a two page summary of common SLURM commands and options.
- http://slurm.schedmd.com/documentation.html Best source for online documentation
- http://slurm.schedmd.com/slurm.html
- http://slurm.schedmd.com/man_index.html
- man <slurm_command> (from the command line)
- http://www.glue.umd.edu/hpcc/help/slurm-vs-moab.html A more complete comparison table between slurm and moab
- http://www.schedmd.com/slurmdocs/rosetta.pdf is a table of slurm commands and their counterparts in a number different batch systems