Session 14: Shared Memory Parallel Programming with OpenMP

General information

As seen in previously, we may use ManeFrame II for parallel computing (in addition to running serial jobs). In the remainder of this tutorial, we will differentiate between running shared-memory parallel (SMP) programs, typically enabled by OpenMP, and distributed-memory parallel (DMP) programs, typically enabled by MPI. Hybrid MPI+OpenMP programming is also possible on ManeFrame II, and we will end this tutorial session with a brief study of those.

We have also seen that we have the choice between GNU, PGI, Intel compilers when compiling our codes on ManeFrame II. In the tutorial below, we will focus on use of the GNU compilers; use of the PGI and Intel compilers is similar.

Getting started

Since we’ll be using the GNU compiler throughout this tutorial load the gcc-7.3 module:

$ module load gcc-7.3
$ module list

Second, you will need to retrieve sets of files for the OpenMP, MPI and hybrid MPI+OpenMP portions of this session. Retrieve the files for the OpenMP portion by clicking this link or by copying them on ManeFrame II at the command line:

$ cp /hpc/examples/workshops/hpc/session14.tgz .

Shared-memory programs

We may run shared-memory programs on any ManeFrame II worker node. All ManeFrame II worker nodes have 8 CPU cores. In my experience, shared-memory programs rarely benefit from using more execution threads than the number of physical cores on a node, so I recommend that SMP jobs use at most 8 threads, though your application may act differently.

Enabling OpenMP

OpenMP is implemented as an extension to existing programming languages, and is available for programs written in C, C++, Fortran77 and Fortran90. These OpenMP extensions are enabled at the compiler level, with most compilers supporting OpenMP. In these compilers, OpenMP is enabled through supplying a flag to the relevant compiler denoting that you wish for it to allow the OpenMP extensions to the existing language. The various compiler flags for well-known compilers include:

  • PGI: -mp
  • GNU: -fopenmp
  • Intel: -openmp
  • IBM: -qsmp
  • Oracle: -xopenmp
  • Absoft: -openmp
  • Cray: (on by default)
  • NAG: -openmp

Compiling with OpenMP

Before proceeding to the following subsections, unpack the OpenMP portion of this tutorial using the usual commands:

$ tar -zxf session14.tgz
$ cd session14

In the resulting directory, you will find a number of files, including Makefile, driver.cpp and vectors.cpp.

You can compile the executable driver.exe with the GNU compiler and OpenMP using the command

$ g++ -fopenmp driver.cpp vectors.cpp -lm -o driver.exe

The compiler option -fopenmp is the same, no matter which GNU compiler you are using (gcc, gfortran, etc.)

Note

The only difference when using the PGI compilers is the compiler name and OpenMP flag, e.g.

$ pgc++ -mp driver.cpp vectors.cpp -lm -o driver.exe

Running with OpenMP

Running OpenMP programs at the command line

Run the executable driver.exe from the command line:

$ ./driver.exe

Depending on your default setup, you will have either used 1 or 8 threads.

To control the number of threads used by our program, we must adjust the OMP_NUM_THREADS environment variable. First, check your current default value (it may be blank):

$ echo $OMP_NUM_THREADS

The method for re-setting this environment variable will depend on our login shell. First, determine which login shell you use:

$ echo $SHELL

For CSH/TCSH users, you can set your OMP_NUM_THREADS environment variable to 2 with the command:

$ setenv OMP_NUM_THREADS 2

the same may be accomplished by BASH/SH/KSH users with the command:

$ export OMP_NUM_THREADS=2

Re-run driver.exe first using 1 and then using 3 OpenMP threads. Notice the speedup when running with multiple threads. Also notice that although the result, Final rms norm is essentially the same in both runs, the results differ slightly after around the 11th digit. The reasoning is beyond the scope of this tutorial, but in short this results from a combination of floating-point roundoff errors and differences in the order of arithmetic operations. The punch line being that bitwise identicality between runs is difficult to achieve in parallel computations, and in any case may not be necessary in the first place.

Running OpenMP batch jobs

To run OpenMP-enabled batch job, the steps are identical to those required for requesting an exclusive node, except that we must additionally specify the environment variable OMP_NUM_THREADS. It is recommended that this variable be supplied inside the batch job submission file to ensure reproducibility of results.

Create a batch job submission file:

#!/bin/bash
#SBATCH -J test1          # job name
#SBATCH -o test1.txt      # output/error file name
#SBATCH -p workshop       # requested queue
#SBATCH --exclusive       # do not share the compute node
#SBATCH -t 1              # maximum runtime in minutes

# set the desired number of OpenMP threads
export OMP_NUM_THREADS=7

# run the code
./driver.exe

Recall, the --exclusive option indicates that we wish to run the job on an entire node (without sharing that node with others). This is critical for SMP jobs, since each SMP job will launch multiple threads of execution, so we do not want to intrude on other users by running threads on their CPU cores!

Furthermore, note that once the job is launched, it will use 7 of the 8 available hardware threads on that node, implying that one core will remain idle.

Note

In fact, each worker node does much more than just run your job (runs the operating system, handles network traffic, etc.), so in many instances SMP jobs run faster when using \(N-1\) threads than when using \(N\) threads, where \(N\) is the number of CPU cores, since this leaves one core to handle all remaining non-job duties.

OpenMP exercise

Compile the program driver.exe using the GNU compiler with OpenMP enabled.

Create a single SLURM submission script that will run the program driver.exe using 1, 2, 3, …, 12 OpenMP threads on ManeFrame II’s parallel partition. Recall from session 5 that you may embed multiple commands within your job submission script.

Launch this job, and when it has completed, determine the parallel efficiency (i.e. strong scaling performance) of this code (defined in session 6, parallel_computing_metrics). How well does the program perform? Is there a maximum number of threads where, beyond which, additional resources no longer improve the speed?

Note

If you finish this early, perform the same experiment but this time using the PGI compiler. How do your results differ?