Skip to content

Intel oneAPI

In December 2020, Intel released its oneAPI software development suite, which is free of charge and replaces the previous or Intel Parallel Studio XE Cluster Edition that CHPC licensed. In our environment, we install the Base and HPC toolkits, which include the following libraries and applications:

Programming languages:

Libraries (need to be loaded as a separate module):

  • Intel oneAPI Math Kernel Library (MKL) - high performance math library, module load mkl. Official page
  • Intel oneAPI Collective Communications Library - communication library for machine learning frameworks, module load ccl. Official page
  • Intel oneAPI Data Analytics Library - accelerated library for data analytics and machine learning, module load dal. Official page
  • Intel oneAPI Deep Neural Network Library - accelerated deep learning library, module load dnnl. Official page
  • Intel oneAPI DPC++ Library - external DPC++ library, module load dpl. Official page
  • Intel MPI Library - a high performance MPI based on MPICH2, module load impi. Official page
  • Intel Integrated Performance Primitives - optimized software building blocks, module load ipp. Official page
  • Intel Integrated Performance Primitives for Cryptography- optimized software building blocks, module load ippcp. Official page
  • Intel oneAPI Thread Building Blocks - C and C++ library for creating high performance, scalable parallel applications, module load tbb. Official page
  • Intel oneAPI Video Processing Library - accelerated video processing library, module load vpl. Official page

Programming tools:

All of the Intel tools have excellent documentation and we recommend you follow the tutorials to learn how to use the tools and the documentation to find details. Most of the tools work well out of the box, however, few of the tools have peculiarities regarding our local installation which we try to make known in this document.

Intel Compilers

The Intel compilers are described on our compiler page. They provide a wealth of code optimization options and generally produce the fastest code on the Intel CPU platforms. Please, note that the oneAPI compilers don't include the libraries by default, so, if you use a library that you used in the past as a part of the Intel compilers (e.g. MKL or TBB), you now need to load the appropriate module, as described below.

Intel Math Kernel Library

MKL is a high performance math library containing full BLAS, LAPACK, ScaLapack, transforms and more. It is available as module load mkl. For details on MKL use see our math libraries page.

Intel Advisor

Intel Advisor is a thread and vectorization prototyping tool. The vectorization can be very helpful in detecting loops that could take advantage of vectorization with simple code changes, potentially doubling to quadrupling the performance on CPUs with higher vectorization capabilities (CHPC's Kingspeak and Notchpeak clusters).

Start with loading the module - module load advisor. Then launch the tool GUI with advixe-gui.

Then either use your own serial code or get examples from the Advisor web page.

The thread prototyping process generally involves four steps:

  1. Create a project and survey (time) the target code
  2. Annotate the target code to tell Advisor what sections to parallelize
  3. Analyze the annotated code to predict parallel performance and predict parallel problems
  4. Add the parallel framework (OpenMP, TBB, Cilk+) to the code based on feedback from step 3

The vectorization profiling involves the following:

  1. Target survey to explore where to add vectorization or threading
  2. Find trip counts to see how many iterations each loop executes
  3. Check data dependencies in the loop and use the Advisor hints to fix them
  4. Check memory accesses to identify and fix complex data access patterns

Intel Inspector

Intel Inspector is a memory and thread error debugging tool.

Start with loading the module - module load inspector.

Then either use your own serial code or get examples from the Inspector web page.

Inspector has two major workflows, one geared at the memory and the other at thread inspection and debugging. The debugging process involves compilation of the code with the -g flag, creating and populating a project, selecting the type of debugging (memory or thread), running the code inside of the Inspector tool and when done go over the report that Inspector provides. Be aware that Inspector may be reporting false positives which can be turned off for future analysis using supressions. See the user's guide and tutorials for details how to use the tool.

Intel VTune Profiler

VTune is an advanced performance profiler. Its main appeal is integrated code performance measurement and evaluation and support for multithreading on both CPUs and accelerators (GPUs, Intel Xeon Phis).

For CPU based profiling on CHPC Linux systems. We highly recommend using a whole compute node for your profiling session, profiling on either an interactive node or using shared SLURM partition will have a high likelihood of other user processes on that node to affect your program's performance and produce an unreliable profiling result.

To profile an application in VTune GUI, do the following:

1. Source the VTune environment: module load vtune
2. Start the VTune GUI: vtune-gui
3. Follow the GUI instructions to start a new sampling experiment, run it and then visualize the results.

If you want to use the CPU hardware counters to count CPU events, you also need to load the VTune Linux kernel modules before the profiling, and unload them after done. To be able to do that, you need to be added to the sudoers group that can do this by sending a message to our help desk.

1. Source the VTune environment: module load vtune
2. Load the kernel module: sudo $VTUNE_DIR/sepdk/src/insmod-sep -r -g vtune
3. Start the VTune GUI: vtune-gui
4. Follow the GUI instructions to start a new sampling experiment, run it and then visualize the results.
5. Unload the kernel module: sudo $VTUNE_DIR/sepdk/src/rmmod-sep

To profile a distributed parallel application (e.g. MPI), one has to use the command line interface for VTune. This can be done either in the SLURM job script, or ran as an interactive job. Inside of the script, or interactive job, do the following:

1. Source the VTune environment: module load vtune
2. Source the appropriate compiler and MPI, e.g. module load intel impi
3. Run the VTune command line command, e.g. mpirun -np $SLURM_NTASKS vtune -collect hotspots -result-dir /path/to/directory/with/VTune/result myExecutable. Note that we are explicitly stating where to put the results, as the VTune default results directory name can cause problems during the job launch.
4. To analyze the results, we recommend to use the VTune GUI, i.e. on the cluster interactive node, start, and then use the "Open Result" option in the main GUI window to find the directory with the result obtained above.

Finding the right command line parameters for the needed analysis can be cumbersome, which is why VTune provides a button that displays the command line. This button looks like  >_ and is located at the bottom center of the VTune GUI window.

If you need hardware counters, you'll need to start the kernel module on all the job's nodes before running the mpirun, as:
srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node=1 sudo $VTUNE_DIR/sepdk/src/insmod-sep -r -g vtune

And similarly unload the kernel module from all job's nodes at the end of the job.

Intel MPI

Intel MPI is a high performance MPI library which runs on many different network interfaces. The main reason for having IMPI, though is its seamless integration with ITAC and its features. It's generally slightly slower than the top choice MPIs that we use on the clusters, though, there may be applications in which IMPI outperforms our other MPIs so we recommend to include IMPI in performance testing before deciding what MPI to use for production runs. For a quick introduction to Intel MPI, see the Getting Started guide.

Intel MPI by default works with whatever interface it finds on the machine at runtime. To use it module load impi .

For best performance we recommend using Intel compilers along with the IMPI, so, to build, use the Intel compiler wrapper calls mpiicc, mpiicpc, mpiifort.

For example

mpiicc code.c -o executable

Since IMPI is designed to run on multiple network interfaces, one just needs to build a single executable which should be able to run on all CHPC clusters. Combining this with the Intel compiler's automatic CPU dispatch flag (-axCORE-AVX2,AVX,SSE4.2) allows to build a single executable for all the clusters. The network interface selection is controlled with the I_MPI_FABRICS environment variable. The default should be the fastest network, in our case InfiniBand. We can verify the network selection by running the Intel MPI benchmark and look at the time it takes to send a message from one node to another:

srun -n 2 -N 2 -A mygroup -p ember --pty /bin/tcsh -l

mpirun -np 2 /uufs/

# Benchmarking PingPong
# #processes = 2
#bytes #repetitions t[usec] Mbytes/sec
0 1000 1.74 0.00

It takes 1.75 microseconds to send a message there and back which is typical for InfiniBand network.

Intel MPI provides two different MPI fabrics for InfiniBand, one based on Open Fabrics Enterprise Distribution (OFED), and the other on Direct Access Programming Library (DAPL), denoted by ofa and dapl, respectively. Moreover, one can also specify intra-node communication, out of which the fastest should be shared memory(shm). According to our observations, the default fabrics is shm:dapl, which can be confirmed by using environment variable I_MPI_DEBUG larger than 2, e.g.
mpirun -genv I_MPI_DEBUG 2 -np 2 /uufs/
[0] MPI startup(): shm and dapl data transfer modes

The performance of the OFED and DAPL are comparable, but, it may be worth-wile to test both to see if your particular application gets a boost from one fabrics or the other.

If we'd like to use the Ethernet network instead (except for Lonepeak, not recommended for production due to slower communication speed), we choose I_MPI_FABRICS tcp and get:

mpirun -genv I_MPI_FABRICS tcp -np 2 /uufs/

# Benchmarking PingPong
# #processes = 2
#bytes #repetitions t[usec] Mbytes/sec
0 1000 18.56 0.00

Notice that the latency on the Ethernet is about 10x larger than on the InfiniBand.

As of Intel MPI 5.0 and MPICH 3.1 (and MVAPICH2 1.9 and higher which is based on MPICH 3.1), the libraries are interchangeable at the binary level, using common Application Binary Interface (ABI). This in practice means that one can build the application with MPICH, but, run it using the Intel MPI libraries, and thus taking advantage of the Intel MPI functionality. See details about this at

Intel Trace Analyzer and Collector

ITAC can be used for MPI code checking and for profiling. To use it, module load itac.

MPI/OpenMP profiling

It is the best to run ITAC with Intel compiler and MPI, since that way one can take advantage of their interoperability. That is, also

module load intel impi itac

On existing code that was built with IMPI, just run

mpirun -trace -n 4 ./a.out

This will produce a set of trace files a.out.stf*, which are then loaded to the Trace Analyzer as

traceanalyzer a.out.stf &

To take advantage of additional profiling features, compile with -trace command as

mpiicc -trace code.c

ITAC reference guide is a good resource for more detailed info about other ways to invoke the tracing, instrumentation, etc. A good tutorial on how to use ITAC to profile a MPI code is named Detecting and Removing Unnecessary Serialization

MPI correctness check

To run the correctness checker, the easiest is to compile with -check_mpi flag as

mpiicc -check_mpi code.c and then plainly run as

mpirun -check -n 4 ./a.out

If the executable was built with other MPI of the MPICH2 family, one can specifically invoke the checker library by

mpirun -genv LD_PRELOAD -genv VT_CHECK_TRACING on -n 4 ./a.out

At least this is what the manual says, but, it seems like it's just creating the trace file. So, the safest way is to use -check_mpi during compilation. The way to tell the MPI checking is enabled is that the program will start writing out a lot of output describing what it's doing during the runtime, such as:


Once the program is done, if there is no MPI error, it'll say:

[0] INFO: Error checking completed without finding any problems.

We recommend anyone who is developing MPI program to run their program through the MPI checker before starting some serious use of the program. It can help to uncover hidden problems that could be hard to locate during normal runtime.

Intel's tutorial on this topic is here.

Last Updated: 6/10/21