Running Jobs on GPUs via Slurm

To access GPUs through a Slurm job, there are a few extra parameters that users need to add to their Slurm batch script or to the salloc command. This pages details the parameters required and how to edit these parameters to match the requirements of your job.

If you are interested in learning more about the types of GPU resources the CHPC provides, you can find more information on the GPUs available at the CHPC here and details on what GPUs are available on each cluster here. Not all software or research benefits from use with GPUs and, therefore, not all CHPC users have access to GPUs at the CHPC. If you need access to our GPUs, you can email us at helpdesk@chpc.utah.edu and explain how your research requires GPUs, at which point we will grant you access.

The upcoming Granite cluster will require allocations for GPU access starting January 1, 2025. All other clusters currently do not need allocations to require GPUs.

Slurm Jobs and GPUs
- General Nodes
- Owner Nodes
GPU Resource Specification
Example Batch Script
Interactive Jobs
Finding Available GPUs
Node Sharing & GPUs

Slurm Jobs and GPUs

To request GPU resources within a Slurm job, you need to request both the GPU-specific partitions, with their associated account, and the use of GPU resources with the use of the --gres flag. Below, you will find information on how to pass these parameters to Slurm.

Account and Partition Slurm Flags

The account and partition #SBATCH parameters that you pass onto Slurm will change depending on who owns the resources (CHPC versus PIs). The respective sections will detail how to pass on the correct account/partition pairs. Unsure what you have access to? You can always check your Slurm access through the myallocation command.

General Nodes

For General nodes, the corresponding SLURM partition and account settings are:
- --account=$(clustername)-gpu
- --partition=$(clustername)-gpu
where $(clustername) stands for either notchpeak, kingspeak, lonepeak or redwood.

Owner Nodes

For Owner nodes (non-general nodes) the corresponding SLURM partition and account settings depend on ownership of the nodes:
- For members of the groups who own GPU nodes, a specific instance of SLURM account & partition settings will be given to get access to these nodes. For example, the members of the group who own the nodes kp{359-362} in the soc-gpu-kp partition need to use the following settings:
  - --account=soc-gpu-kp
  - --partition=soc-gpu-kp
- Users outside the group can also use these devices. The corresponding account & partition names are:
  - --account=owner-gpu-guest
  - --partition=$(clustername)-gpu-guest
  where $(clustername) stands for either notchpeak or kingspeak. Note that the jobs by users outside the group may be subjected to preemption.

GPU Resource Specification

If a user wants to access the GPU devices on a node, it is required that the user must specify the gres flag (a.k.a.generic consumable resources). The gres flag has the following syntax:

--gres=$(resource_type):[$(resource_name):$(resource_count)]

where:

$(resource_type) is always equal to gpu for the GPU devices.
$(resource_name) is a string which describes the type of the requested gpu(s), e.g. 1080ti, titanv, 2080ti, etc
$(resource_count) is the number of gpu devices that are requested of the type $(resource_name).
Its value is an integer between 1 and the maximum number of GPU devices on the node.
the [ ] signals optional parameters in the --gres flag, meaning, to request any single GPU (the default for the count is 1), regardless of a type, --gres=gpu will work. To request more than one GPU of any type, one can add the $(resource_count), e.g. --gres=gpu:2

However, the flag --gres=gpu:titanx:5 must be used to request 5 GTX Titan X devices that can only be satisfied by the nodes kp297 and kp298.

Note that if you do not specify the gres flag, your job will run on a GPU node (presuming you use the correct combination of the --partition and --account flag), but it will not have access to the node's GPUs.

Example Batch Script

An example script that would request two notchpeak nodes with 2xM2090 GPUs, including all cores and all memory, running one GPU per MPI task, would look like this:

#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --mem=0
#SBATCH --partition=notchpeak-gpu
#SBATCH --account=notchpeak-gpu
#SBATCH --gres=gpu:m2090:2
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe

To request all 8 3090 GPUs on notch328again using one GPU per MPI task, we would do:

#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --mem=0
#SBATCH --partition=notchpeak-gpu
#SBATCH --account=notchpeak-gpu
#SBATCH --gres=gpu:3090:8
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe

As an example, using the script below, you would get four GPUs, four CPU cores, and 8GB of memory. The remaining GPUs, CPUs, and memory will then be accessible for other jobs.

#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:titanx:4
#SBATCH --account=kingspeak-gpu
#SBATCH --partition=kingspeak-gpu

The script below will ask for 14 CPU cores, 100 GB of memory and 1 GPU card on one of the P100 nodes.

#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=14
#SBATCH --gres=gpu:p100:1
#SBATCH --mem=100GB
#SBATCH --account=owner-gpu-guest
#SBATCH --partition=kingspeak-gpu-guest

Interactive Jobs

To run an interactive job, add the --gres=gpu option to the salloc command, such as the example SLURM script below.

This will allocate the resources to the job, namely two tasks and two GPUs. To run parallel job, use srun or mpirun commands to launch the calculation on the allocated compute node resources. To specify more memory than the default 2GB per task, use the --mem option.

salloc -n 2 -N 1 -t 1:00:00 -p kingspeak-gpu -A kingspeak-gpu --gres=gpu:titanx:2

For serial, non-MPI jobs, utilizing one or more GPUs, ask for one node, such as the example SLURM script below.

This will allocate the resources to the job, namely one core (task) and one GPUs. To run the job, use the srun command to launch the calculation on the allocated compute node resources.

salloc -n 1 -N 1 -A owner-gpu-guest -p kingspeak-gpu-guest --gres=gpu:p100:1

Listing free GPUs

To find what GPUs are free on what partition, run the freegpus command. By default this command scans over all the clusters and partition, see freegpus --help for a list of command options. For example, to list free GPUs on the notchpeak-shared-short partition, run:

$ freegpus -p notchpeak-gpu
GPUS_FREE: notchpeak-gpu
v100 (x1)
2080ti (x9)
3090 (x2)
a100 (x2)

The output lists the GPU type with the count of the free GPUs of that type in the parentheses after the letter x.

Node Sharing

Some programs are serial, or able to run only on a single GPU; other jobs perform better on a single or small number of GPUs and therefore cannot efficiently make use of all of the GPUs on a single node. Therefore, in order to better utilize our GPU nodes, node sharing has been enabled for the GPU partitions.

Node sharing allows multiple jobs to run on the same node, each job being assigned specific resources (e.g. number of cores, amount of memory, number of accelerators). The node resources are managed by the scheduler up to the maximum number of available resources on each node. It should be noted that while efforts are made to isolate jobs running on the same node, there are still many shared components in the system. Therefore a job's performance can be affected by the other job(s) running on the node at the same time and if you are doing benchmarking you will want to request the entire node even if your job will only make use of part of the node.

Node sharing can be accessed by requesting less than the full number of gpus, core and/or memory. Note that node sharing can also be done on the basis of the number of cores and/or memory, or all three. By default, each job gets 2 GB of memory per core requested (the lowest common denominator among our cluster nodes), therefore to request a different amount than the default amount of memory, you must use --mem flag . To request exclusive use of the node, use --mem=0.

When node sharing is on (by default unless asking full number of GPUs, cores, or memory), the SLURM scheduler automatically sets task to core affinity, mapping one task per physical core. To find what cores are bound to the job's tasks, run:

cat /cgroup/cpuset/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOB_ID/cpuset.cpus

Below is a list of useful job modifiers for use:

Option	Explanation
#SBATCH --gres=gpu:1080ti:1	request one 1080ti GPU
#SBATCH --mem=4G	request 4 GB of RAM
#SBATCH --mem=0	request all memory of the node; this option also ensures node is in exclusive use by the job
#SBATCH --ntasks=1	requests 1 task, mapping it to 1 CPU core