GPUs and Accelerators at the CHPC
The CHPC has a limited number of compute nodes with GPUs. The GPU devices are to be found on the Kingspeak, Notchpeak, Lonepeak, and Redwood (i.e. the Protected Environment) clusters. This document describes the hardware, access, and usage of these resources.
GPU Hardware Overview
The CHPC has a limited number of GPU nodes on notchpeak, kingspeak, lonepeak, and redwood (PE).
The use of the GPU nodes on notchpeak and redwood does not affect your allocation (i.e., their usage does not count against any allocation your group may have). However, we have restricted the access to the GPU nodes to users who require access to the GPU nodes as a part of their research. Please e-mail helpdesk@chpc.utah.edu if you need access. Note that the GPU nodes are a limited resources, and therefore they should only be used for jobs that make use of the GPUs. If CHPC staff notice any jobs on these nodes (outside of the shared-short ones, described below) not using the GPUs we will kill the jobs.
There are three categories of GPU nodes:
- General nodes are nodes purchased by the CHPC and can be used by anyone with a CHPC account. The partition name for these is
$(clustername)-gpu
where$(clustername)
stands for eithernotchpeak
,kingspeak
, orlonepeak
in the general environment, orredwood
in the protected environment.
- Owner nodes (non-general nodes) are nodes purchased by a research group and they can be identified as such by their
partition name, which is
$(owner)-gpu-$(cluster)
where$(owner)
is the owner name and$(cluster)
refers to the cluster the node is on (np for notchpeak, kp for kingspeak, and rw for redwood). Note that there are no owner gpu nodes on lonepeak. - notchpeak-shared-short nodes are a special partition of nodes on notchpeak, used mostly for classes and for usage that does not require a long wall time and can be completed with a low core count and minimal memory. These nodes can be used by anyone with a CHPC account, without an additional request (see note at the end of this section). There is also a redwood-shared-short set of nodes on redwood for users inside of the protected environment (PE).
To see the GPU specifications on the nodes you can use the si version of the sinfo command along with the partition name. As an example,
si -p notchpeak-gpu
gives
In the above image, the AVAIL_FEATURES includes information about the GPU types (v100, 3090, 2080ti, etc.) and the usage of this feature along with the information on the use of constraints found on the CHPC slurm page. For additional information on the specifications and performance for each of the different GPUs, the best resource is to do a search on the GPU.
The AVAIL_FEATURES also provide information on the rest of the node, such as the cpu generation (skl, mil, etc; see the chpc list of these provided here), the physical core ccount of the cpu (with the feature cxx, where xx is the core count) , and the amount of cpu memory (feature mxxx, where xxx is the amount of memory in GB).
You also need to know how many GPUs are available per node. You can find this information by invoking the command below with the nodename. As an example, for node notch293:
In the above, in the 5th line of the command output, there is a gres listing - this give the type (3090 in this case) and the number of GPUs available (4) for this node.
GPU Types on CHPC Nodes
The following describes the different types of GPU nodes that were purcahsed by the CHPC and made available to our researchers. See the Nvidia GPUs wiki page for more details on each GPU type.
- P40 devices (Pascal generation). It contains 24 GB of global memory (GDDR5X) with a memory bandwidth of 346 GB/s. It has a single-precision performance of 12 TFlops; the double-precision performance is 0.35 TFlops.
- Titan V devices (Volta generation). Each device has 5120 CUDA cores and a global memory of 12 GB (HBM2). Its memory bandwidth is 652.8GB/s. Each device has the following performance specifics: 6.9 TFlops (Double Precision), 13.8 TFlops (Single Precision), 27.6 (Half Precision), & 110 TFlops (Tensor Performance - Deep Learning).
- GTX 1080 Ti devices (Pascal generation). Each device contains 11 GB of global memory (GDDR5X) with a memory bandwidth of 484 GB/s. It also contains 3584 CUDA cores. Each GPU Card has a single-precision performance of 10.6 TFlops. The double-precision performance is only 0.33 TFlops.
- RTX 2080 Ti devices (Turing generation). Each GPU device has 11 GB (GDDR6) global memory with a memory bandwidth of 616 GB/s. It also contains 4352 CUDA cores. The single-precision performance of each GPU device is 11.75 TFlops. Its double-precision performance is 0.37 TFlops.
- GeForce TitanX devices (Maxwell generation). Each of these GPU devices has 12 GB of global memory. Each device has great single-precison performance (~7 TFlops) but does rather poor with double-precision (max ca. 200 GFlops). Therefore, the TitanX nodes should be used for either single-precision, or mixed single-double precision GPU codes.
- Tesla P100 devices (Pascal generation). Each device has 16 GB global memory. Each GPU card contains 56 multiprocessors with each 64 cores (3584 CUDA cores in total). The ECE support is (currently) disabled.The system interface is a PCIe Gen3 Bus. The double-precision performance per card is 4.7 TFlops. The single-precision is 9.3 TFlops anf the half-precision performance is 18.7 TFlops.
- Tesla V100 (Volta generation). Each GPU has 16 GB of global memory. Peak double-precision performance of a Tesla V100 is 7 TFlops; its peak single-precision performance is 14 TFlops. Its memory bandwidth is 900 GB/s. It also contains 5120 CUDA cores.
- Tesla A100 (Ampere generation). Each GPU has either 40 or 80 GB of global memory and a peak double-precision performance of 9.7 TFlops. The total memory bandwidth is 1555 Gb/s. There are 6912 INT32/FP32 CUDA Cores.
- RTX 3090 (Ampere generation). Each GPU has 24 GB global memory and INT32/FP32 10,496 CUDA cores and 328 Tensor cores. The peak single precision performace is 35.7 TFlops. The memory bandwidth is 936 Gb/s.
- Tesla T4 (Turing generation). Each GPU has 16 GB global memory, with 2,560 INT32/FP32 CUDA cores and 320 Tensor cores. The peak single precision performace is 8.1 TFlops, with 65 TFlops peak mixed precision performance. The memory bandwidth is 320 Gb/s.
- Tesla A40 (Ampere generation): Each GPU has 48 GB global memory. The memory bandwith is 696 GB/s. There are 10,752 INT32/FP32 CUDA cores, 84 2nd generation Ray Tracing cores, and 336 3rd genereration Tensor cores.
- RTX 3090 (Ampere generation): Each GPU has 24 GB memory, 10,496 INT32/FP32 CUDA cores.
- Tesla A30 (Ampere generation): Each GPU has 24 GB global memory and a peak double-precision performance of 5.2 TFlops. The memory bandwidth is 933 GB/s.
- RTX A6000 (Ampere generation): Each GPU has 48 GB memory, 10,752 INT32/FP32 CUDA cores, memory bandwidth 768 GB/s.
- H100 (Hopper generation): Each GPU has either 80 or 96 GB memory, 16896 INT32/FP32 CUDA cores, peak double precision performance 34 or 30 TFlops, and memory bandwidth 3.35 or 3.9 TB/s.
*Note that this list does not include those GPUs hosted by the CHPC but owned by particular
research groups. You can find the GPUs available in the owner nodes through the command
si -p $(clustername)-gpu-guest
, where $(clustername) is replaced by notchpeak, kingspeak, or redwood.
Access to GPUs and Running Jobs with GPUs
General and Owner Nodes
Below is information on requesting the GPU nodes in your SLURM batch script.
- For General nodes, the corresponding SLURM partition and account settings are:
--account=$(clustername)-gpu
--partition=$(clustername)-gpu
$(clustername)
stands for eithernotchpeak
,kingspeak
,lonepeak
orredwood
. - For Owner nodes (non-general nodes) the corresponding SLURM partition and account settings depend on ownership of the
nodes:
- For members of the groups who own GPU nodes, a specific instance of SLURM account & partition settings will be given to get access
to these nodes. For example, the members of the group who own the nodes kp{359-362} in the soc-gpu-kp partition need to use the following settings:
--account=soc-gpu-kp
--partition=soc-gpu-kp
- Users outside the group can also use these devices. The corresponding account & partition names are:
--account=owner-gpu-guest
--partition=$(clustername)-gpu-guest
$(clustername)
stands for eithernotchpeak
orkingspeak
. Note that the jobs by users outside the group may be subjected to preemption.
- For members of the groups who own GPU nodes, a specific instance of SLURM account & partition settings will be given to get access
to these nodes. For example, the members of the group who own the nodes kp{359-362} in the soc-gpu-kp partition need to use the following settings:
GPU Resource Specification
If a user wants to access the GPU devices on a node, it is required that the user must specify the gres flag (a.k.a.generic consumable resources). The gres flag has the following syntax:
--gres=$(resource_type):[$(resource_name):$(resource_count)]
where:
- $(resource_type) is always equal to
gpu
for the GPU devices. - $(resource_name) is a string which describes the type of the requested gpu(s), e.g. 1080ti, titanv, 2080ti, etc
- $(resource_count) is the number of gpu devices that are requested of the type $(resource_name).
Its value is an integer between 1 and the maximum number of GPU devices on the node. - the [ ] signals optional parameters in the --gres flag, meaning, to request any single
GPU (the default for the count is 1), regardless of a type,
--gres=gpu
will work. To request more than one GPU of any type, one can add the $(resource_count), e.g.--gres=gpu:2
However, the flag --gres=gpu:titanx:5
must be used to request 5 GTX Titan X devices that can only be satisfied by the nodes
kp297
and kp298.
Note that if you do not specify the gres flag, your job will run on a GPU node (presuming you use the correct combination of the --partition and --account flag), but it will not have access to the node's GPUs.
Listing free GPUs
To find what GPUs are free on what partition, run the freegpus
command. By default this command scans over all the clusters and partition, see freegpus --help
for a list of command options. For example, to list free GPUs on the notchpeak-shared-short
partition, run:
$ freegpus -p notchpeak-gpu
GPUS_FREE: notchpeak-gpu
v100 (x1)
2080ti (x9)
3090 (x2)
a100 (x2)
The output lists the GPU type with the count of the free GPUs of that type in the parentheses after the letter x.
Node Sharing
Some programs are serial, or able to run only on a single GPU; other jobs perform better on a single or small number of GPUs and therefore cannot efficiently make use of all of the GPUs on a single node. Therefore, in order to better utilize our GPU nodes, node sharing has been enabled for the GPU partitions.
Node sharing allows multiple jobs to run on the same node, each job being assigned specific resources (e.g. number of cores, amount of memory, number of accelerators). The node resources are managed by the scheduler up to the maximum number of available resources on each node. It should be noted that while efforts are made to isolate jobs running on the same node, there are still many shared components in the system. Therefore a job's performance can be affected by the other job(s) running on the node at the same time and if you are doing benchmarking you will want to request the entire node even if your job will only make use of part of the node.
Node sharing can be accessed by requesting less than the full number of gpus, core
and/or memory. Note that node sharing can also be done on the basis of the number of cores and/or
memory, or all three. By default, each job gets 2 GB of memory per core requested (the lowest common denominator
among our cluster nodes), therefore to request a different amount than the default
amount of memory, you must use --mem
flag . To request exclusive use of the node, use --mem=0.
When node sharing is on (by default unless asking full number of GPUs, cores, or memory), the SLURM scheduler automatically sets task to core affinity, mapping one task per physical core. To find what cores are bound to the job's tasks, run:
cat /cgroup/cpuset/slurm/uid_$SLURM_JOB_UID/job_$SLURM_JOB_ID/cpuset.cpus
Below is a list of useful job modifiers for use:
Option | Explanation |
#SBATCH --gres=gpu:1080ti:1 | request one 1080ti GPU |
#SBATCH --mem=4G | request 4 GB of RAM |
#SBATCH --mem=0 |
request all memory of the node; this option also |
#SBATCH --ntasks=1 | requests 1 task, mapping it to 1 CPU core |
Example Batch Script
An example script that would request two notchpeak nodes with 2xM2090 GPUs, including all cores and all memory, running one GPU per MPI task, would look like this:
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --mem=0
#SBATCH --partition=notchpeak-gpu
#SBATCH --account=notchpeak-gpu
#SBATCH --gres=gpu:m2090:2
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe
To request all 8 3090 GPUs on notch328again using one GPU per MPI task, we would do:
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --mem=0
#SBATCH --partition=notchpeak-gpu
#SBATCH --account=notchpeak-gpu
#SBATCH --gres=gpu:3090:8
#SBATCH --time=1:00:00
... prepare scratch directory, etc
mpirun -np $SLURM_NTASKS myprogram.exe
As an example, using the script below, you would get four GPUs, four CPU cores, and 8GB of memory. The remaining GPUs, CPUs, and memory will then be accessible for other jobs.
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:titanx:4
#SBATCH --account=kingspeak-gpu
#SBATCH --partition=kingspeak-gpu
The script below will ask for 14 CPU cores, 100 GB of memory and 1 GPU card on one of the P100 nodes.
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=14
#SBATCH --gres=gpu:p100:1
#SBATCH --mem=100GB
#SBATCH --account=owner-gpu-guest
#SBATCH --partition=kingspeak-gpu-guest
Interactive Jobs
To run an interactive job, add the --gres=gpu
option to the salloc
command, such as the example SLURM script below.
This will allocate the resources to the job, namely two tasks and two GPUs. To run
parallel job, use srun or mpirun commands to launch the calculation on the allocated
compute node resources. To specify more memory than the default 2GB per task, use
the --mem
option.
salloc -n 2 -N 1 -t 1:00:00 -p kingspeak-gpu -A kingspeak-gpu --gres=gpu:titanx:2
For serial, non-MPI jobs, utilizing one or more GPUs, ask for one node, such as the example SLURM script below.
This will allocate the resources to the job, namely one core (task) and one GPUs. To run the job, use the srun command to launch the calculation on the allocated compute node resources.
salloc -n 1 -N 1 -A owner-gpu-guest -p kingspeak-gpu-guest --gres=gpu:p100:1
GPU Programming Environment and Performance
As of early 2021, we offer Nvidia programming tools, including CUDA, PGI CUDA Fortran
and the OpenACC compilers. The latest programming tools are all included in the Nvidia HPC SDK, available by module load nvhpc
, and include both the former Nvidia CUDA compilers and PGI compilers, along with
the Nvidia GPU libraries, debugger and profilers.
Alternatively, one can explicitly load a CUDA version with module load cuda/version
. Different CUDA versions can be also used, run module spider cuda
to see what versions are available. Deprecated PGI compilers which come with their
own CUDA can be set up by loading the PGI module, module load pgi
.
To compile CUDA code so that it runs on all the four types of GPUs that we have, use
the following compiler flags: -gencode arch=compute_20,code=sm_20 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_52,code=sm_52
-gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70
. For more info on the CUDA compilation and linking flags, please have a look at http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
The Nvidia HPC compilers (formerly PGI compilers) specify the GPU architecture with the-tp=tesla
flag. If no further option is specified, the flag will generate code for all available
computing capabilities (at the time of writing cc20, cc30, cc35, cc50 and cc60). To
be specific for each GPU, -tp=tesla:cc20
can be used for the M2090,-tp=tesla:cc50
for the TitanX and-tp=tesla:cc60
for the P100. To invoke the OpenACC, use -acc
flag. More information on OpenACC can be obtained at http://www.openacc.org.
Good tutorials on GPU programming are available at the CUDA Education and Training site from Nvidia.
When running the GPU code, it is worth checking the resources that the program is
using, to ensure that the GPU is well utilized. For that, one can run the nvidia-smi
command, and watch for the memory and CPU utilization. nvidia-smi
is also useful to query and set various features of the GPU, see nvidia-smi --help
for all the options that the command takes. For example, nvidia-smi -L
lists the GPU card properties. On the TitanX nodes:
Titan Card: GPU 0: GeForce GTX TITAN X (UUID: GPU-cd731d6a-ee18-f902-17ff-1477cc59fc15)
Note that you can also check on the utilization of the GPUs assigned to a batch job from an interactive node with the following command and the job number:
srun --pty --jobid XXXX nvidia-smi
This will return the nvidia-smi results for the GPU(s) assigned to your job.
Libraries
Nvidia HPC SDK bundles Math libraries which can be used to offload computation to the GPUs and Communication libraries which can be used for fast communication between multiple GPUs. The Math libraries
are located at $NVROOT/math_libs
and the communication libraries are at $NVROOT/comm_libs
.
To compile and link, e.g. with cuBLAS, include the following parameters in the compilation
line: -I$NVROOT/math_libs/include
, and these parameters in the link line: -L$NVROOT/math_libs/lib64 -Wl,-rpath=$NVROOT/math_libs/lib64 -lcublas
.
Debugging
Nvidia HPC SDK and CUDA distributions include a terminal debugger named cuda-gdb
. Its operation is similar to the GNU gdb
debugger. For details, see the cuda-gdb documentation.
For out of bounds and misaligned memory access errors, there is the cuda-gdbmemcheck
tool. For details, see the cuda-memcheck documentation.
The Totalview debugger that we license used to license and DDT debugger that we currently license also support CUDA and OpenACC debugging. Due to its user friendly graphical interface we recommend them for GPU debugging. For information on how to use DDT or Totalview, see our debugging page.
Profiling
Profiling can be very useful in finding GPU code performance problems, for example
inefficient GPU utilization, use of shared memory, etc. Nvidia CUDA provides a visual
profiler called Nsight Systems(nsight-sys
). There are also older, deprecated, command line profiler (nprof
) and visual profiler (nvvp
). More information is in the CUDA profilers documentation.
NOTE that the usage of GPU hardware counters is restricted due to a security problem in the Nvidia driver, as detailed in this Nvidia post. This error will be demonstrated by the following:
$ nvprof -m all ./my-gpu-program
==4983== Warning: ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device. See the following link
for instructions to enable permissions and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
In this case, please open a support ticket, asking for a reservation on a GPU node. Our admins will enable the hardware counters profiling for all users inside of this reservation and instruct how to use it.
Installed GPU codes
We have the following GPU codes installed:
Code name | Module name | Prerequisite modules | Sample batch script(s) location | Other notes |
HOOMD | hoomd |
gcc/4.8.5 mpich2/3.2.g |
/uufs/chpc.utah.edu/sys/installdir/hoomd/2.0.0g-[sp,dp]/examples/ | |
VASP | vasp |
intel impi cuda/7.5 |
/uufs/chpc.utah.edu/sys/installdir/vasp/examples | Per group license, let us know if you need access |
AMBER | amber-cuda |
gcc/4.4.7 mvapich2/2.1.g |
adapt CPU script | |
LAMMPS | lammps/10Aug15 | intel/2016.0.109 impi/5.1.1.109 | adapt CPU script |
If there is any other GPU code that you would like us installed, please let us know.
Some commercial programs that we have installed, such as Matlab, also have GPU support. Either try them or contact us.