

SiPearl Outlook

Teratec

Jean-Marc Denis

Chief Strategy Officer jean-marc.denis@sipearl.com





### -SiPearl corporate overview

### The European Server Processor Solution

HQ: Maisons-Laffitte (Paris), France

**Incorporated in June 2019** 

**CEO and Founder, Philippe Notton** 

#### **Design centers:**

- France: Maisons-Laffitte, Massy Palaiseau, Sophia Antipolis, Grenoble
- Germany: Duisburg (Düsseldorf)
- Spain: Barcelona

Key Personnel from Intel, Atos, ST, Marvell, Nokia, Mstar-Mediatek

HPC Targeted Architecture based on Arm Neoverse V1 cores

+100 employees today, targeting >1,000 in 2025





### -SiPearl offices We are close to our partners and customers



# -Sipearl extensions

 $\langle \rangle$ 

 $\bigcirc$ 

### SIPEARL CORPORATE VISION AND STRATEGY



#### Our business model is sustainable over time

### OVERALL ROADMAP



# EPI COMMON PLATFORM ENABLES EU ECOSYSTEM

- SiPearl chartered is also to develop the European Ecosystem
- SiPearl shares IP and benefits from IP ecosystem
  - Accelerator development (RISC-V based)
    - Al (tensor)
    - Vector processing
    - Stencil processing
    - FPGA
    - ...
  - Packaging
  - P IP development
- Staged integration: start with socket-to-socket connections and move into package (multi-chiplets) over time



# Rhea a Processor for the Exa era



### - At the heart of Rhea

### With its high-performance, low-power Arm Neoverse V1 architecture, Rhea will meet the needs of all supercomputing workloads.

#### Key features

| Core                   | <ul> <li>Arm architecture</li> <li>Neoverse V1 cores</li> <li>SVE 256 per core supporting 64/32/BF16 and Int8</li> <li>ArmVirtualization extensions</li> </ul>                               |
|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SoC                    | <ul> <li>Arm mesh fabric</li> <li>Advanced RAS support including Arm RAS extensions</li> <li>Link protection for NoC &amp; high-speed IO</li> <li>ECC support for selected memory</li> </ul> |
| Cache                  | <ul><li>Large L3 (Shared Level Cache)</li><li>RAS supported for all cache levels</li></ul>                                                                                                   |
| Memory                 | <ul> <li>HBM2e</li> <li>And DDR5</li> <li>ECC for memory and link protection for controllers</li> </ul>                                                                                      |
| High Speed I/O         | <ul><li>PCIe, CCIX &amp; CXL</li><li>Root and endpoint support</li></ul>                                                                                                                     |
| Other I/O              | - USB, GPIO, SPI, I <sup>2</sup> C                                                                                                                                                           |
| Power Management       | <ul> <li>Power management block to optimize perf/watt accross use<br/>cases and workloads.</li> </ul>                                                                                        |
| Security Block Support | <ul> <li>Secure boot and secure upgrade</li> <li>Crypto</li> <li>True random number generation</li> <li>Made in Europe</li> </ul>                                                            |

Rhea will deliver extraordinary real compute performance and efficiency with an unmatched Bytes/Flops ratio.



SIPEARL



### PROCESSOR CORES INSIDE RHEA

|                            | Total: x V1 + 2 M7 Arm, 29 R     | Remarks   |                                     |
|----------------------------|----------------------------------|-----------|-------------------------------------|
| Arm Neoverse V1 cores      | Arm Neoverse V1                  | arm       | Including spare V1s.                |
| Arm cortex-M7 cores        | 2x Arm cortex-M7 = 2.            | arm       | for SCP and MCP subsystems.         |
| Risc-V in PMS              | 1x Ariane + 1x ZeroRiscy = 2.    | RISC-V°   |                                     |
| Risc-V in SEG              | 1x Ariane = 1.                   | 🔀 RISC-V° | SEG for security element.           |
| Risc-V in STXs of 2x ERACs | 2x (1x Ariane + 8x Snitch) = 18. | 🔀 RISC-V° |                                     |
| Risc-V in VRPs of 2x ERACs | 2x (4x VRP core) = 8.            | 🔀 RISC-V° | VRP core is a modified Risc-V core. |
| SPUs in STXs of 2x ERACs   | 2x (2x SPU cores) = 4.           | 💦 RISC-V° | SPU core is a proprietary core.     |

- Some additional EU designed IP (power management, clock, cryptography) not counted here
- Not including µC cores used in Synopsys DDR controllers for the PHY training.

| Core   | Performance for the core.                 |
|--------|-------------------------------------------|
| V1     | 2x 256 SVE = 16 DP FLOPs/cycle; 2.5GHz@N6 |
| Snitch | 1x 64b FPU = 2 DP FLOPs/cycle; >1GHz@N6   |
| SPU    | 4x 32b FPU = 8 SP FLOPs/cycle; >1GHz@N6   |

### STENCIL: DESIGNED FOR CLASSICAL HPC MODELS

- Designed in first place for finite difference and finite 1. elements algorithms (example: CFD, FDTD, O&G)
- Expanded to support wider class of algorithms 2. while retaining efficiency
- Ease of programmability as a design goal 3.
  - Accelerator for physicists rather than computer scientist
  - Also applicable to other domains e.g. weather forecasting, CFD and energy

| <pre>#pragma omp target {     #pragma stx loop     for (int z = stencil_radius; z &lt; dim_z - stencil_radius; z++)     {         #pragma stx loop(interleave)         for (int y = stencil_radius; y &lt; dim_y - stencil_radius; y++)     } }</pre>          | Shown: ISO<br>code fron                                    |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
| <pre>{     #pragma stx loop     for (int x = stencil_radius; x &lt; dim_x - stencil_radius; x++)     {         float dxyz = 0.0f;         for (int r = 1; r &lt;= stencil_radius; r++)         {         float const weight = parameter (r - 1);     } }</pre> | User can w<br><b>expressic</b><br>accesses, "stru<br>is ma |
| <pre>dxyz += ( device_pressure_wavefield[z + r][y ][x ]</pre>                                                                                                                                                                                                  | data[x]<br>data[r] (n<br>data[x][y] or d                   |
| dxyz += ( device_pressure_wavefield[z ][y ][x + r]<br>+ device_pressure_wavefield[z ][v ][x - r]                                                                                                                                                               | data[x + (2                                                |

#### stencil example source m our "spu-runtime" repository.

write natural, complex Fraunhofer ons as indices for data ructured data configuration ade automatically.

(HW-loop variable) non-HW loop variable) data[x +/- r] data[y][x] (arbitrary order) 2 \* r / 3)][y – ((r + 4) / 2)]



#### Usecase: RTM TTI

#### Structure

- Intersecting planes of mixed derivations around a center point
- P- and O-wavefields
- Velocity wavefield
- 4 additional physical parameter wave fields
- High arithmetic intensity: 5 - 18 flops/byte



#### Supporting

- RTM TTI 1-pass
- RTM TTI 2-pass
- Micro application kernel incorporating forward and backward propagation
- Kernel optimizations included (common subexpression elemination, precomputation, plane scheme, etc.)



 $\mathbb{N}$ 

# -Variable Precision processor (VRP)

### Definition

- The Variable Precision Processor (VRP) is a domain specific accelerator for scientific computing, specially tailored for the accurate computation (up to 512 bits fractional parts) of large systems of equations.
- It supports IEEE 754 extendable format in memory with byte-aligned data format to optimize memory usage and computing efficiency.

### **Motivation**

- Reduce conjugate gradient (CG), bi-CG iteration count
- Simplify preconditioning
- Allows direct solvers instead of indirect (matrix with bad conditioning number)
- Generally valid for many other algorithms, in particular for Krylov-based projective resolution.
- More investigation on lanczos based eigenvalue and singular-value solvers

### Performances

- It targets 10x to 100x acceleration of variable precision
- computation (compared to software solutions).





SIPE/RL 13

### - Rhea1 – Memory configurations

### Single socket



Two different / independent memory spaces

### dual sockets



Two different / independent memory spaces

- 1 unified (CC-NUMA) HBM space
- 1 unified (CC-NUMA) DDR space



### - Atos Reference Board with Rhea1



|                 |              |     |   | R1a                |         |         | R1b                |         |         |
|-----------------|--------------|-----|---|--------------------|---------|---------|--------------------|---------|---------|
| CCIX (16x) → 3  |              |     |   | CCIX 16x<br>to R1b | GPU 16x | NIC 16x | CCIX 16x<br>to R1a | GPU 16x | NIC 16x |
| GPU (16x) → 4   | PCle<br>CCIX | 16x | 4 | 3                  |         | 1       | 3                  |         | 1       |
| • NIC (16x) → 2 | PCle         | 16x | 2 |                    | 2       |         |                    | 2       |         |
|                 | PCle         | 4x  | 2 |                    |         |         |                    |         |         |

SIPEARL 15



# Performance





SIPEARL 17



SIPEARL 18

 $\bigcirc$ 



#### INSTRUCTION LATENCY (SMALLER IS BETTER)

| best in class     | Sort-of-ex-aequo                         |                                                                       |              |                                                                  |                                                           |                                                           |
|-------------------|------------------------------------------|-----------------------------------------------------------------------|--------------|------------------------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------------------|
| [                 |                                          |                                                                       |              | I                                                                |                                                           | 1                                                         |
| VFP64, full width | Fujitsu                                  | Arm                                                                   | Intel        | Intel                                                            | AMD                                                       | AMD                                                       |
| Latency           | A64FX                                    | Neoverse V1                                                           | Broadwell    | Skylake-X                                                        | Rome / Zen2                                               | Milan / Zen 3                                             |
| Add               | 9                                        | 2                                                                     | 3            | 4                                                                | 3                                                         | 3                                                         |
| Mul               | 9                                        | 3                                                                     | 3            | 4                                                                | 3                                                         | 3                                                         |
| FMA               | 9                                        | 4 (2 if chained)                                                      | 5            | 4                                                                | 5                                                         | 4                                                         |
| Div               | 43                                       | 7 to 15                                                               | 19-23        | 24                                                               | 13                                                        | 13,5                                                      |
| Sqrt              | 43                                       | 7 to 16                                                               | 27-29        | 28-29                                                            | 20                                                        | 20                                                        |
| Throughput        |                                          |                                                                       |              |                                                                  |                                                           |                                                           |
| Add               | 2                                        | 2                                                                     | 1            | 2                                                                | 2                                                         | 2                                                         |
| Mul               | 2                                        | 2                                                                     | 2            | 2                                                                | 2                                                         | 2                                                         |
| FMA               | 2                                        | 2                                                                     | 2            | 2                                                                | 2                                                         | 2                                                         |
| Div               | 1/43                                     | 1/14 to 1/7                                                           | 1/16         | 1/16                                                             | 1/5                                                       | 1/(4.5)                                                   |
| Sqrt              | 1/43                                     | 1/14 to 1/7                                                           | 1/28 to 1/16 | 1/24 to 1/18                                                     | 1/9                                                       | 1/9                                                       |
| Max SIMD          | SVE [512]                                | SVE [256]                                                             | AVX2 [256]   | AVX-512                                                          | AVX2 [256]                                                | AVX2 [256]                                                |
|                   | NEON, scalar have the same<br>throughput | Neon, Scalar have twice the<br>throughput (4x128 instead of<br>2x256) |              | XCC-based cores; SSE,<br>AVX, Scalar have the same<br>throughput | SSE, AVX, Scalar have the same throughput                 | SSE, AVX, Scalar have the same throughput                 |
|                   |                                          | https://developer.arm.com/do                                          |              | https://www.agner.org/optimi<br>ze/instruction_tables.pdf        | https://www.agner.org/optimi<br>ze/instruction_tables.pdf | https://www.agner.org/optimi<br>ze/instruction_tables.pdf |

#### About SiPearl

Created by Philippe Notton, SiPearl is designing the high-performance, low-power microprocessor for European exascale supercomputers. This new generation of microprocessors will enable Europe to set out its technological sovereignty in strategic high performance computing markets such as artificial intelligence, medical research or climate modelling.

SiPearl is working in close collaboration with its 27 partners from the European Processor Initiative (EPI) consortium - leading names from the scientific community, supercomputing centres and industry - which are its stakeholders, future clients and end-users.

SiPearl employs 109<sup>\*</sup> people in France, Germany and Spain. Its first range of microprocessors, Rhea, will be launched at the end of the year.

The company is supported by the European Union (funding from the European Union's Horizon 2020 research and innovation program under specific grant agreement no.826647).

\* as of June 15th 2022

Contact Jean-Marc Denis Chief Strategy Officer jean-marc.denis@sipearl.com

