Intel oneAPI

In December 2020, Intel released its oneAPI software development suite, which is free of charge and replaces the previous or Intel Parallel Studio XE Cluster Edition that CHPC licensed. In our environment, we install the Base and HPC toolkits, which include the following libraries and applications:

Programming languages:

Intel oneAPI C++ and Data Parallel C++ compiler - C/C++ compiler - official page
Intel Fortran Compiler - Fortran compiler - official page
Intel Distribution for Python - accelerated Python interpreter and libraries, also available from Anaconda - official page

Libraries, Need to be loaded as a separate module, please contact CHPC if they are not available, or you need newer version, as we don't install these unless there is demand. Compiler needs to be loaded beforehand:

Intel oneAPI Math Kernel Library (MKL) - high performance math library, module load intel-oneapi-mkl. Official page
Intel oneAPI Collective Communications Library - communication library for machine learning frameworks, module load intel-oneapi-ccl. Official page
Intel oneAPI Data Analytics Library - accelerated library for data analytics and machine learning, module load intel-oneapi-dal. Official page
Intel oneAPI Deep Neural Network Library - accelerated deep learning library, module load intel-oneapi-dnn. Official page
Intel oneAPI DPC++ Library - external DPC++ library, module load intel-oneapi-dpl. Official page
Intel MPI Library - a high performance MPI based on MPICH2, module load intel-oneapi-mpi. Official page
Intel Integrated Performance Primitives - optimized software building blocks, module load intel-oneapi-ipp. Official page
Intel Integrated Performance Primitives for Cryptography- optimized software building blocks, module load intel-oneapi-ippcp. Official page
Intel oneAPI Thread Building Blocks - C and C++ library for creating high performance, scalable parallel applications, module load intel-oneapi-tbb. Official page
Intel oneAPI Video Processing Library - accelerated video processing library, module load intel-oneapi-vpl. Official page

Programming tools:

Intel Advisor - thread and vectorization prototyping tool - Official page
Intel Inspector - memory and thread error debugger - Official page
Intel VTune Profiler - performance profiler - Official page
Intel Trace Analyzer and Collector (ITAC) - MPI profiler and runtime correctness checker - Official page

All of the Intel tools have excellent documentation and we recommend you follow the tutorials to learn how to use the tools and the documentation to find details. Most of the tools work well out of the box, however, few of the tools have peculiarities regarding our local installation which we try to make known in this document.

Intel Compilers

The Intel compilers are described on our compiler page. They provide a wealth of code optimization options and generally produce the fastest code on the Intel CPU platforms. Please, note that the oneAPI compilers don't include the libraries by default, so, if you use a library that you used in the past as a part of the Intel compilers (e.g. MKL or TBB), you now need to load the appropriate module, as described below.

Intel Math Kernel Library

MKL is a high performance math library containing full BLAS, LAPACK, ScaLapack, transforms and more. It is available as module load intel-oneapi-mkl. For details on MKL use see our math libraries page.

Intel Inspector

Intel Inspector is a thread and memory debugging tool. It helps in finding errors due to multi-threading (thread safety), and due to memory access (segmentation faults), which are difficult to analyze with traditional source code debuggers.

Start with loading the module - module load intel-oneapi-inspector. Then launch the tool GUI with inspxe-gui.

Then either use your own serial code or get examples from the Inspector tutorials.

Intel Advisor

Intel Advisor is a thread and vectorization prototyping tool. The vectorization can be very helpful in detecting loops that could take advantage of vectorization with simple code changes, potentially doubling to quadrupling the performance on CPUs with higher vectorization capabilities (CHPC's Kingspeak and Notchpeak clusters).

Start with loading the module - module load intel-oneapi-advisor. Then launch the tool GUI with advixe-gui.

Then either use your own serial code or get examples from the Advisor web page.

The thread prototyping process generally involves four steps:

Create a project and survey (time) the target code
Annotate the target code to tell Advisor what sections to parallelize
Analyze the annotated code to predict parallel performance and predict parallel problems
Add the parallel framework (OpenMP, TBB, Cilk+) to the code based on feedback from step 3

The vectorization profiling involves the following:

Target survey to explore where to add vectorization or threading
Find trip counts to see how many iterations each loop executes
Check data dependencies in the loop and use the Advisor hints to fix them
Check memory accesses to identify and fix complex data access patterns

Intel VTune Profiler

VTune is an advanced performance profiler. Its main appeal is integrated code performance measurement and evaluation and support for multithreading on both CPUs and accelerators (GPUs, Intel Xeon Phis).

For CPU based profiling on CHPC Linux systems. We highly recommend using a whole compute node for your profiling session, profiling on either an interactive node or using shared SLURM partition will have a high likelihood of other user processes on that node to affect your program's performance and produce an unreliable profiling result.

To profile an application in VTune GUI, do the following:

1. Source the VTune environment: module load intel-oneapi-vtune
2. Start the VTune GUI: vtune-gui
3. Follow the GUI instructions to start a new sampling experiment, run it and then visualize the results.

If you want to use the CPU hardware counters to count CPU events, you also need to load the VTune Linux kernel modules before the profiling, and unload them after done. To be able to do that, you need to be added to the sudoers group that can do this by sending a message to our help desk.

1. Source the VTune environment: module load intel-oneapi-vtune
2. Load the kernel module: sudo $VTUNE_DIR/sepdk/src/insmod-sep -r -g vtune
3. Start the VTune GUI: vtune-gui
4. Follow the GUI instructions to start a new sampling experiment, run it and then visualize the results.
5. Unload the kernel module: sudo $VTUNE_DIR/sepdk/src/rmmod-sep

To profile a distributed parallel application (e.g. MPI), one has to use the command line interface for VTune. This can be done either in the SLURM job script, or ran as an interactive job. Inside of the script, or interactive job, do the following:

1. Source the VTune environment: module load intel-oneapi-vtune2. Source the appropriate compiler and MPI, e.g. module load intel intel-oneapi-mpi
3. Run the VTune command line command, e.g. mpirun -np $SLURM_NTASKS vtune -collect hotspots -result-dir /path/to/directory/with/VTune/result myExecutable.Note that we are explicitly stating where to put the results, as the VTune default results directory name can cause problems during the job launch.
4. To analyze the results, we recommend to use the VTune GUI, i.e. on the cluster interactive node, start, and then use the "Open Result" option in the main GUI window to find the directory with the result obtained above.

Finding the right command line parameters for the needed analysis can be cumbersome, which is why VTune provides a button that displays the command line. This button looks like >_ and is located at the bottom center of the VTune GUI window.

If you need hardware counters, you'll need to start the kernel module on all the job's nodes before running the mpirun, as:srun -n $SLURM_JOB_NUM_NODES --ntasks-per-node=1 sudo $VTUNE_DIR/sepdk/src/insmod-sep -r -g vtune

And similarly unload the kernel module from all job's nodes at the end of the job.

Intel MPI

Intel MPI is a high performance MPI library which runs on many different network interfaces. The main reason for having IMPI, though is its seamless integration with ITAC and its features. It's generally slightly slower than the top choice MPIs that we use on the clusters, though, there may be applications in which IMPI outperforms our other MPIs so we recommend to include IMPI in performance testing before deciding what MPI to use for production runs. For a quick introduction to Intel MPI, see the Getting Started guide.

Intel MPI by default works with whatever interface it finds on the machine at runtime. To use it module load intel-oneapi-mpi . For details, see the MPI Library help page.

Intel Trace Analyzer and Collector

ITAC can be used for MPI code checking and for profiling. To use it, module load intel-oneapi-itac.

MPI/OpenMPI Profiling

It is the best to run ITAC with Intel compiler and MPI, since that way one can take advantage of their interoperability. That is, also

module load intel-oneapi-compilers intel-oneapi-mpi intel-oneapi-itac

On existing code that was built with IMPI, just run

mpirun -trace -n 4 ./a.out

This will produce a set of trace files a.out.stf*, which are then loaded to the Trace Analyzer as

traceanalyzer a.out.stf &

To take advantage of additional profiling features, compile with -trace command as

mpiicc -trace code.c

ITAC reference guide is a good resource for more detailed info about other ways to invoke the tracing, instrumentation, etc. A good tutorial on how to use ITAC to profile a MPI code is named Detecting and Removing Unnecessary Serialization

MPI Correctness Check

To run the correctness checker, the easiest is to compile with -check_mpi flag as

mpiicc -check_mpi code.c and then plainly run as

mpirun -check -n 4 ./a.out

If the executable was built with other MPI of the MPICH2 family, one can specifically invoke the checker library by

mpirun -genv LD_PRELOAD libVTmc.so -genv VT_CHECK_TRACING on -n 4 ./a.out

At least this is what the manual says, but, it seems like it's just creating the trace file. So, the safest way is to use -check_mpi during compilation. The way to tell the MPI checking is enabled is that the program will start writing out a lot of output describing what it's doing during the runtime, such as:

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON

Once the program is done, if there is no MPI error, it'll say:

[0] INFO: Error checking completed without finding any problems.

We recommend anyone who is developing MPI program to run their program through the MPI checker before starting some serious use of the program. It can help to uncover hidden problems that could be hard to locate during normal runtime.

Intel's tutorial on this topic is here.