Running Jobs on GPUs via Slurm
To access GPUs through a Slurm job, there are a few extra parameters that users need to add to their Slurm batch script or to the salloc command. This pages details the parameters required and how to edit these parameters to match the requirements of your job.
If you are interested in learning more about the types of GPU resources the CHPC provides, you can find more information on the GPUs available at the CHPC here and details on what GPUs are available on each cluster here. Not all software or research benefits from use with GPUs and, therefore, not all CHPC users have access to GPUs at the CHPC. If you need access to our GPUs, you can email us at helpdesk@chpc.utah.edu and explain how your research requires GPUs, at which point we will grant you access.
The upcoming Granite cluster will require allocations for GPU access starting January 1, 2025. All other clusters currently do not need allocations to require GPUs.
Slurm Jobs and GPUs
To request GPU resources within a Slurm job, you need to request both the GPU-specific partitions, with their associated account, and the use of GPU resources with the use of the --gres flag. Below, you will find information on how to pass these parameters to Slurm.
Account and Partition Slurm Flags
The account and partition #SBATCH parameters that you pass onto Slurm will change
depending on who owns the resources (CHPC versus PIs). The respective sections will
detail how to pass on the correct account/partition pairs. Unsure what you have access
to? You can always check your Slurm access through the myallocation
command.
General Nodes
- For General nodes, the corresponding SLURM partition and account settings are:
--account=$(clustername)-gpu
--partition=$(clustername)-gpu
$(clustername)
stands for eithernotchpeak
,kingspeak
,lonepeak
orredwood
.
Owner Nodes
- For Owner nodes (non-general nodes) the corresponding SLURM partition and account settings depend on ownership of the
nodes:
- For members of the groups who own GPU nodes, a specific instance of SLURM account & partition settings will be given to get access
to these nodes. For example, the members of the group who own the nodes kp{359-362} in the soc-gpu-kp partition need to use the following settings:
--account=soc-gpu-kp
--partition=soc-gpu-kp
- Users outside the group can also use these devices. The corresponding account & partition names are:
--account=owner-gpu-guest
--partition=$(clustername)-gpu-guest
$(clustername)
stands for eithernotchpeak
orkingspeak
. Note that the jobs by users outside the group may be subjected to preemption.
- For members of the groups who own GPU nodes, a specific instance of SLURM account & partition settings will be given to get access
to these nodes. For example, the members of the group who own the nodes kp{359-362} in the soc-gpu-kp partition need to use the following settings:
GPU Resource Specification
If a user wants to access the GPU devices on a node, it is required that the user must specify the gres flag (a.k.a.generic consumable resources). The gres flag has the following syntax:
--gres=$(resource_type):[$(resource_name):$(resource_count)]
where:
- $(resource_type) is always equal to
gpu
for the GPU devices. - $(resource_name) is a string which describes the type of the requested gpu(s), e.g. 1080ti, titanv, 2080ti, etc
- $(resource_count) is the number of gpu devices that are requested of the type $(resource_name).
Its value is an integer between 1 and the maximum number of GPU devices on the node. - the [ ] signals optional parameters in the --gres flag, meaning, to request any single
GPU (the default for the count is 1), regardless of a type,
--gres=gpu
will work. To request more than one GPU of any type, one can add the $(resource_count), e.g.--gres=gpu:2
However, the flag --gres=gpu:titanx:5
must be used to request 5 GTX Titan X devices that can only be satisfied by the nodes
kp297
and kp298.
Note that if you do not specify the gres flag, your job will run on a GPU node (presuming you use the correct combination of the --partition and --account flag), but it will not have access to the node's GPUs.
Example Batch Script
An example script that would request two notchpeak nodes with 2xM2090 GPUs, including all cores and all memory, running one GPU per MPI task, would look like this:
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --mem=0
#SBATCH --partition=notchpeak-gpu
#SBATCH --account=notchpeak-gpu
#SBATCH --gres=gpu:m2090:2
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe
To request all 8 3090 GPUs on notch328again using one GPU per MPI task, we would do:
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --mem=0
#SBATCH --partition=notchpeak-gpu
#SBATCH --account=notchpeak-gpu
#SBATCH --gres=gpu:3090:8
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe
As an example, using the script below, you would get four GPUs, four CPU cores, and 8GB of memory. The remaining GPUs, CPUs, and memory will then be accessible for other jobs.
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:titanx:4
#SBATCH --account=kingspeak-gpu
#SBATCH --partition=kingspeak-gpu
The script below will ask for 14 CPU cores, 100 GB of memory and 1 GPU card on one of the P100 nodes.
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=14
#SBATCH --gres=gpu:p100:1
#SBATCH --mem=100GB
#SBATCH --account=owner-gpu-guest
#SBATCH --partition=kingspeak-gpu-guest
Interactive Jobs
To run an interactive job, add the --gres=gpu
option to the salloc
command, such as the example SLURM script below.
This will allocate the resources to the job, namely two tasks and two GPUs. To run
parallel job, use srun or mpirun commands to launch the calculation on the allocated
compute node resources. To specify more memory than the default 2GB per task, use
the --mem
option.
salloc -n 2 -N 1 -t 1:00:00 -p kingspeak-gpu -A kingspeak-gpu --gres=gpu:titanx:2
For serial, non-MPI jobs, utilizing one or more GPUs, ask for one node, such as the example SLURM script below.
This will allocate the resources to the job, namely one core (task) and one GPUs. To run the job, use the srun command to launch the calculation on the allocated compute node resources.
salloc -n 1 -N 1 -A owner-gpu-guest -p kingspeak-gpu-guest --gres=gpu:p100:1
Listing free GPUs
To find what GPUs are free on what partition, run the freegpus
command. By default this command scans over all the clusters and partition, see freegpus --help
for a list of command options. For example, to list free GPUs on the notchpeak-shared-short
partition, run:
$ freegpus -p notchpeak-gpu
GPUS_FREE: notchpeak-gpu
v100 (x1)
2080ti (x9)
3090 (x2)
a100 (x2)
The output lists the GPU type with the count of the free GPUs of that type in the parentheses after the letter x.
Node Sharing
Some programs are serial, or able to run only on a single GPU; other jobs perform better on a single or small number of GPUs and therefore cannot efficiently make use of all of the GPUs on a single node. Therefore, in order to better utilize our GPU nodes, node sharing has been enabled for the GPU partitions.
Node sharing allows multiple jobs to run on the same node, each job being assigned specific resources (e.g. number of cores, amount of memory, number of accelerators). The node resources are managed by the scheduler up to the maximum number of available resources on each node. It should be noted that while efforts are made to isolate jobs running on the same node, there are still many shared components in the system. Therefore a job's performance can be affected by the other job(s) running on the node at the same time and if you are doing benchmarking you will want to request the entire node even if your job will only make use of part of the node.
Node sharing can be accessed by requesting less than the full number of gpus, core
and/or memory. Note that node sharing can also be done on the basis of the number of cores and/or
memory, or all three. By default, each job gets 2 GB of memory per core requested (the lowest common denominator
among our cluster nodes), therefore to request a different amount than the default
amount of memory, you must use --mem
flag . To request exclusive use of the node, use --mem=0.
When node sharing is on (by default unless asking full number of GPUs, cores, or memory), the SLURM scheduler automatically sets task to core affinity, mapping one task per physical core. To find what cores are bound to the job's tasks, run:
cat /cgroup/cpuset/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOB_ID/cpuset.cpus
Below is a list of useful job modifiers for use:
Option | Explanation |
#SBATCH --gres=gpu:1080ti:1 | request one 1080ti GPU |
#SBATCH --mem=4G | request 4 GB of RAM |
#SBATCH --mem=0 |
request all memory of the node; this option also |
#SBATCH --ntasks=1 | requests 1 task, mapping it to 1 CPU core |