Rough Guide to Preparing Software for Exascale

Introduction

This Rough Guide is a Crib Sheet for improving the efficiency of parallel software for future supercomputers, and was last updated 17th February, 2023 by g.pringle@epcc.ed.ac.uk.

This guide first appeared as an appendix of a CompBioMed deliverable [1]. This guide was then updated and appeared as an appendix for the CoE EXCELERAT deliverable [2].

The term Exascale is used to describe HPC hardware capable of at least one exaFLOPS, or 10^18 FLoating point OPeration per Second. It is envisioned that such machines will have many multi-core processors, and that the available memory per core will be far inferior to those on current HPC platforms. For instance, this was seen when attempting to port MPI codes to IBM Blue Gene machines, or the Intel Xeon Phi family, where the amount of memory per core is prohibitively small for many codes parallelised using MPI only. As such, the common practice of running one MPI task per physical core may no longer be possible for the majority of codes in the future.

For many technical and economic reasons many HPC systems are deploying GPUs (currently from NVIDIA or AMD) to achieve greater performance. Indeed, Frontier was recently announced as the first official exascale machine and makes use of AMD GPUs to accelerate its calculations.

The solution for getting codes ready for exascale platforms requires both software and hardware related strategies. The former, the subject of this note, is described below. The latter, beyond the scope of this note, is achieved via Co-Design, where hardware vendors and end users work together to ensure future platforms are not built to achieve exascale performance at the expense of usability.

Application codes rarely perform and scale well when first parallelised: each doubling of scale typically exposes a new issue. Ensuring the application will scale on HPC systems - both today and in on the exascale systems of the future - requires stepwise increasing of scale and validation of correctness, debugging, performance analysis and tuning. Then, repeat for each significant code extension/optimisation. Through performance analysis, programmers can locate so-called “hot spots”, i.e. code which takes the most time, as this code should then be targeted for improvement.

Software preparation

Given the memory per core will most likely be substantially reduced when compared to today’s HPC platforms, the practice of assigning one MPI task per physical core will have to be substituted by using every 2nd or 4th core for each MPI task. This is known as under-populating nodes. At first glance, this appears to suggest that we cannot fully exploit the hardware, as we simply avoid using 50% or even 75% of the cores; however, these spare cores can be employed via mix-mode codes, where each MPI task runs threaded routines/loops to run on the remaining cores.

Essentially, authors must expose as many levels of parallelism as possible within their code. Coding this can involve ensemble runs to coupling multiscale codes, multiprocessing (with inter-process communication) to multithreading, vector processing to accelerator-specific commands. This process can slow the code down on present day platforms but will future-proof the code.

For instance, there are sets of serial algorithms, so-called Optimal Serial Algorithms, which are often difficult or simply impossible to parallelise, as these algorithms employ data from the previous steps or even the current step to make improvements at the current step. Such dependencies can prevent concurrent execution of threads in the program, for instance.

The inelegant yet empowering solution is to replace the optimal serial algorithm with a sub-optimal serial algorithm which is, however, parallelisable. Whilst the serial performance may be worse, the parallel performance will soon outperform the serial version as the number of cores increases.

Introduce vector processing

Use appropriate compiler options
Write ordered loops or leave this to compilers?
Innermost loop must have independent iterations
Loop length is either larger of multiple of vector length
It is possible to set this at compiler time but not "probe and populate"
No function calls, except maths libraries
- functions can be vectorised using OpenMP “declare SIMD” feature
No complex control flow
Determinable trip count (i.e. no while)
- the trip count must be known before entering the function at runtime
Data access should be vector aligned, i.e. start at vector boundaries, and preferably continuous
For more advice on vector processing [3]
Be aware of the ISA (Instruction-Set Architecture), such as SSE (Steaming SIMD Extensions, AVX (Advanced Vector Extensions), etc.
- it determines the vector length
- may target vectorised FMA instructions
- Do loop padding manually to get rid of peel/remainder loops
- Concerning vectorisation, we check compiler output or asm code to see what was vectorised
- Use inline hints for functions or routines to help out the compiler to inline
- Remember that YOU know your application better than the compiler does.
- It all depends on how the data is aligned in RAM
Hints with pragmas might be useful, also
Force data alignment with compiler instructions (usually done automatically by the compiler)

Improve MPI code

MPI messages should be grouped to avoid multiple smaller messages
- e.g. use derived data types to avoid double buffering
Avoid any storage or computation of O(nranks)
Avoid all-to-all communication
- e.g. if(rank==0)then do work over all other ranks
Remove unnecessary MPI_BARRIERs
Do not over schedule cores when using threaded maths libraries
- typically control using OMP_NUM_THREADS even when libs do not use OpenMP
Use nonblocking collective communications.
- overlap computation and communications where possible
Remove unnecessary communication synchronisation
- use MPI_TEST rather than MPI_WAIT
- avoid MPI_Probe
  - it most likely forces internal buffering to report the size of the pending message
- Avoid ordered halo swapping,
  - e.g. do not delay y-direction sends until x-direction receives have completed.
  - however, huge network bursts are also not ideal
sometimes, ordered sends allow ordered receives.
and ordered sends might allow to take advantage of the network topology
- e.g. one can completely load the network with x-direction halo swaps and so on
Ensure load is balanced
- Avoided the receive-before-send scenario
  - one-sided communications can alleviate this
Be aware that not all MPI libraries are equal
- e.g., there are many ways to implement collective communications.
Respect the fact that the MPI standard prohibits concurrent read accesses on the same buffer (even though there is no race condition)
- It may reduce the efficiency (or cause bugs)
Tag the source
Be aware: “blocking” has an alternative meaning in the MPI standard.
This can easily lead to serialisation of huge chunks of the program.
Interleave/overlap communication with computation where possible

Improve MPI parallelism

Give each MPI task multiple sub-domains
- a subdomain is a distinct region of the computational domain and a result of the domain decomposition algorithm.
- this allows light weight parallelism on a socket, keeps cache logically together, etc.
Enable multiple tiles per task.
- this might make tiles fit into cache but will spend time swapping boundary information with yourself
Use MPI Communicators, to map the communication to the target HPC topology
- Collective operations are possible on a subset of processes.
- Explicit communicators are very useful to leverage MPI Shared Memory
One-sided communication (or Remote Memory Access (RMA)), can be faster than the message passing model
- May be beneficial when the load is hard to balance, since delays in the receiving process are not necessarily propagating to the sender

Introduce OpenMP for threads on cores and OpenACC for GPUs

A code which uses both MPI and OpenMP, or a code that uses both MPI and OpenACC, is referred to as a mixed-mode code. This is done for two reasons: reduce memory footprint or/and speed up application.

Not all MPI codes benefit from becoming mixed-mode codes. The benefits are as follows.

Hybrid applications have a reduced memory footprint (the shared memory model allows threads to avoid halo regions or ghost cells)
Eases load balance issue (usually the complexity of (adaptive) load balance growths with the number of subdomains)
Load balancing in threads is much easier
- thread-pool model, or
- tasks
For applications which are MPI-bound due to load imbalance (long barriers in MPI_Wait or MPI_Receive/Send), it might be advisable to reduce the number of processes and increase the threads, while using OpenMP’s built in load balancing features

OpenMP for exascale?! All MPI-based codes benefit from under-populating nodes at very high node counts; and by running OpenMP threads on the "unused" cores, you can even scale further out! (The "unused" cores are actually donating their memory, bandwidth and maybe processing MPI comms).

Drawbacks of mixed-mode codes are as follows:

In case of MPI_THREAD_MULTIPLE, the application might lose portability
- forked threads are allowed to call any MPI routines
Shared memory applications have their own problems
- e.g. false sharing, where a cache line is voided repeatedly
  - this is naturally avoided by MPI processes
NUMA effects, e.g. where the data is placed in memory
- this can be resolved by careful task mapping.

Maybe also better to use OpenMP 4.5 target directives than OpenACC.

Improve OpenMP parallelism

An excellent Best Practice Guide for writing mixed-mode programs, i.e. MPI+OpenMP, can be found via the Intertwine project pages [4]

Investigate OpenMP tasks
Try different schedules and/or tasks
Avoid over-scheduling threads when calling threaded maths libraries.
Minimise sequential code
Replicating computation rarely works
Ensure load balance over threads.
- Use different loop schedules or tasks
- May includes balancing communication in one thread with calculations in the rest
Avoid MPI data types as packing data is done on one thread: better to pack data in parallel using threads, as MPI should not need to double pack when data is contiguous
Take care with NUMA effects my considering mapping, i.e. task placement
- e.g. run at least one MPI process per NUMA node
Take care with process and thread binding: threads should run on the same socket as their parent MPI process.
Minimise the number of OpenMP barriers
Use OpenMP directives to force SIMD operations

OpenMP allows explicit vectorisation of functions called from vectorised loops

Code Longevity

Whilst this section includes good practice for software engineers in general, the following points are key when preparing for exascale systems, primarily because large popular codes outlive the programmers who, in turn, typically outlive the HPC platforms for which the software was written.

follow a strict coding style guide
use readable variables
use internal documentation
keep routines to less than one page
copyright statements for every module/subroutine/function
- key for IP monitoring

Rough Guide to Preparing Software for Exascale

Introduction

Software preparation

Improve serial code

Introduce vector processing

Improve MPI code

Improve MPI parallelism

Introduce OpenMP for threads on cores and OpenACC for GPUs

Improve OpenMP parallelism

General programming tips

Code Longevity

Software preparation

Improve serial code

Introduce vector processing

Improve MPI code

Improve MPI parallelism

Introduce OpenMP for threads on cores and OpenACC for GPUs

Improve OpenMP parallelism

General programming tips

Code Longevity

References