This Rough Guide is a Crib Sheet for improving the efficiency of parallel software for future supercomputers, and was last updated 3rd August 2022.
This guide first appeared as an appendix of a CompBioMed deliverable . This guide was then updated and appeared as an appendix for the CoE EXCELERAT deliverable .
The term Exascale is used to describe HPC hardware capable of at least one exaFLOPS, or 10^18 FLoating point OPeration per Second. It is envisioned that such machines will have many multi-core processors, and that the available memory per core will be far inferior to those on current HPC platforms. For instance, this was seen when attempting to port MPI codes to IBM Blue Gene machines, or the Intel Xeon Phi family, where the amount of memory per core is prohibitively small for many codes parallelised using MPI only. As such, the common practice of running one MPI task per physical core may no longer be possible for the majority of codes in the future.
For many technical and economic reasons many HPC systems are deploying GPUs (currently from NVIDIA or AMD) to achieve greater performance. Indeed, Frontier was recently announced as the first official exascale machine and makes use of AMD GPUs to accelerate its calculations.
The solution for getting codes ready for exascale platforms requires both software and hardware related strategies. The former, the subject of this note, is described below. The latter, beyond the scope of this note, is achieved via Co-Design, where hardware vendors and end users work together to ensure future platforms are not built to achieve exascale performance at the expense of usability.
Application codes rarely perform and scale well when first parallelised: each doubling of scale typically exposes a new issue. Ensuring the application will scale on HPC systems - both today and in on the exascale systems of the future - requires stepwise increasing of scale and validation of correctness, debugging, performance analysis and tuning. Then, repeat for each significant code extension/optimisation. Through performance analysis, programmers can locate so-called “hot spots”, i.e. code which takes the most time, as this code should then be targeted for improvement.
Given the memory per core will most likely be substantially reduced when compared to today’s HPC platforms, the practice of assigning one MPI task per physical core will have to be substituted by using every 2nd or 4th core for each MPI task. This is known as under-populating nodes. At first glance, this appears to suggest that we cannot fully exploit the hardware, as we simply avoid using 50% or even 75% of the cores; however, these spare cores can be employed via mix-mode codes, where each MPI task runs threaded routines/loops to run on the remaining cores.
Essentially, authors must expose as many levels of parallelism as possible within their code. Coding this can involve ensemble runs to coupling multiscale codes, multiprocessing (with inter-process communication) to multithreading, vector processing to accelerator-specific commands. This process can slow the code down on present day platforms but will future-proof the code.
For instance, there are sets of serial algorithms, so-called Optimal Serial Algorithms, which are often difficult or simply impossible to parallelise, as these algorithms employ data from the previous steps or even the current step to make improvements at the current step. Such dependencies can prevent concurrent execution of threads in the program, for instance.
The inelegant yet empowering solution is to replace the optimal serial algorithm with a sub-optimal serial algorithm which is, however, parallelisable. Whilst the serial performance may be worse, the parallel performance will soon outperform the serial version as the number of cores increases.
Improve serial code
Before considering how the code is parallelised, the first step is to consider the serial sections of the code.
- Remove excess memory use in serial code.
- When using C++, find good balance of Object Oriented Programming (OOP) and functional programming, as an intensive use of OOP might introduce an unnecessary layer of complexity of the scientific code.
- Ensure proper use standard libraries
Introduce vector processing
- Use appropriate compiler options
- Write ordered loops or leave this to compilers?
- Innermost loop must have independent iterations
- Loop length is either larger of multiple of vector length
- It is possible to set this at compiler time but not "probe and populate"
- No function calls, except maths libraries
- functions can be vectorised using OpenMP “declare SIMD” feature
- No complex control flow
- Determinable trip count (i.e. no while)
- the trip count must be known before entering the function at runtime
- Data access should be vector aligned, i.e. start at vector boundaries, and preferably continuous
- For more advice on vector processing 
- Be aware of the ISA (Instruction-Set Architecture), such as SSE (Steaming SIMD Extensions, AVX (Advanced Vector Extensions), etc.
- it determines the vector length
- may target vectorised FMA instructions
- Do loop padding manually to get rid of peel/remainder loops
- Concerning vectorisation, we check compiler output or asm code to see what was vectorised
- Use inline hints for functions or routines to help out the compiler to inline
- Remember that YOU know your application better than the compiler does.
- It all depends on how the data is aligned in RAM
- Hints with pragmas might be useful, also
- Force data alignment with compiler instructions (usually done automatically by the compiler)
Improve MPI code
- MPI messages should be grouped to avoid multiple smaller messages
- e.g. use derived data types to avoid double buffering
- Avoid any storage or computation of O(nranks)
- Avoid all-to-all communication
- e.g. if(rank==0)then do work over all other ranks
- Remove unnecessary MPI_BARRIERs
- Do not over schedule cores when using threaded maths libraries
- typically control using OMP_NUM_THREADS even when libs do not use OpenMP
- Use nonblocking collective communications.
- overlap computation and communications where possible
- Remove unnecessary communication synchronisation
- use MPI_TEST rather than MPI_WAIT
- avoid MPI_Probe
- it most likely forces internal buffering to report the size of the pending message
- Avoid ordered halo swapping,
- e.g. do not delay y-direction sends until x-direction receives have completed.
- however, huge network bursts are also not ideal
- sometimes, ordered sends allow ordered receives.
- and ordered sends might allow to take advantage of the network topology
- e.g. one can completely load the network with x-direction halo swaps and so on
- Ensure load is balanced
- Avoided the receive-before-send scenario
- one-sided communications can alleviate this
- Avoided the receive-before-send scenario
- Be aware that not all MPI libraries are equal
- e.g., there are many ways to implement collective communications.
- Respect the fact that the MPI standard prohibits concurrent read accesses on the same buffer (even though there is no race condition)
- It may reduce the efficiency (or cause bugs)
- Tag the source
- Be aware: “blocking” has an alternative meaning in the MPI standard.
- This can easily lead to serialisation of huge chunks of the program.
- Interleave/overlap communication with computation where possible
Improve MPI parallelism
- Give each MPI task multiple sub-domains
- a subdomain is a distinct region of the computational domain and a result of the domain decomposition algorithm.
- this allows light weight parallelism on a socket, keeps cache logically together, etc.
- Enable multiple tiles per task.
- this might make tiles fit into cache but will spend time swapping boundary information with yourself
- Use MPI Communicators, to map the communication to the target HPC topology
- Collective operations are possible on a subset of processes.
- Explicit communicators are very useful to leverage MPI Shared Memory
- One-sided communication (or Remote Memory Access (RMA)), can be faster than the message passing model
- May be beneficial when the load is hard to balance, since delays in the receiving process are not necessarily propagating to the sender
Introduce OpenMP for threads on cores and OpenACC for GPUs
A code which uses both MPI and OpenMP, or a code that uses both MPI and OpenACC, is referred to as a mixed-mode code. This is done for two reasons: reduce memory footprint or/and speed up application.
Not all MPI codes benefit from becoming mixed-mode codes. The benefits are as follows.
- Hybrid applications have a reduced memory footprint (the shared memory model allows threads to avoid halo regions or ghost cells)
- Eases load balance issue (usually the complexity of (adaptive) load balance growths with the number of subdomains)
- Load balancing in threads is much easier
- thread-pool model, or
- For applications which are MPI-bound due to load imbalance (long barriers in MPI_Wait or MPI_Receive/Send), it might be advisable to reduce the number of processes and increase the threads, while using OpenMP’s built in load balancing features
Whilst the drawbacks are as follows:
- In case of MPI_THREAD_MULTIPLE, the application might lose portability
- forked threads are allowed to call any MPI routines
- Shared memory applications have their own problems
- e.g. false sharing, where a cache line is voided repeatedly
- this is naturally avoided by MPI processes
- e.g. false sharing, where a cache line is voided repeatedly
- NUMA effects, e.g. where the data is placed in memory
- this can be resolved by careful task mapping.
Maybe also better to use OpenMP 4.5 target directives than OpenACC.
Improve OpenMP parallelism
An excellent Best Practice Guide for writing mixed-mode programs, i.e. MPI+OpenMP, can be found via the Intertwine project pages 
- Investigate OpenMP tasks
- Try different schedules and/or tasks
- Avoid over-scheduling threads when calling threaded maths libraries.
- Minimise sequential code
- Replicating computation rarely works
- Ensure load balance over threads.
- Use different loop schedules or tasks
- May includes balancing communication in one thread with calculations in the rest
- Avoid MPI data types as packing data is done on one thread: better to pack data in parallel using threads, as MPI should not need to double pack when data is contiguous
- Take care with NUMA effects my considering mapping, i.e. task placement
- e.g. run at least one MPI process per NUMA node
- Take care with process and thread binding: threads should run on the same socket as their parent MPI process.
- Minimise the number of OpenMP barriers
- Use OpenMP directives to force SIMD operations
OpenMP allows explicit vectorisation of functions called from vectorised loops
General programming tips
- Be aware of the Load-Hit-Store problem (it does exist on multiple levels)
- prevents caching by the compiler and causes pipeline stalls.
- e.g., appears in MPI-IO (sometimes referred to as Read-Modify-Write effect)
- IO can dominate
- consider MPI-IO or, better still, HDF5
Whilst this section includes good practice for software engineers in general, the following points are key when preparing for exascale systems, primarily because large popular codes outlive the programmers who, in turn, typically outlive the HPC platforms for which the software was written.
- follow a strict coding style guide
- use readable variables
- use internal documentation
- keep routines to less than one page
- copyright statements for every module/subroutine/function
- key for IP monitoring
 D2.2 Report on Deployment of Deep Track Tools and Services to Improve Efficiency of Research and Facilitating Access to CoE Capabilities", CompBioMed, December 2019, https://www.compbiomed.eu/wp-content/uploads/2019/02/D2.2-Report-on-Deployment-of-Deep-Track-Tools-and-Services-to-Improve-Efficiency-of-Research-and-Facilitating-Access-to-CoE-Capabilities_v1.1.pdf
 “D4.7 Final Report on Enhanced Services Progress”, EXCELLERAT, June 2022, https://www.excellerat.eu/wp-content/uploads/2022/06/EXCELLERAT_WP4_D4.7_Final_Report_on_Enhanced_Services_Progress_v1.0.pdf