



### IBM's Next Generation POWER Processor

Hot Chips August 18-20, 2019

Jeff Stuecheli Scott Willenborg William Starke

# Power Systems Proposed POWER Processor Technology and I/O Roadmap



### Focus of 2018 talk

|                            | POWER7 Architecture                                             |                                                                                                       | POWER8 Architecture                                              |                                                                                                      | POWER9 Architecture                                                                                          |                                                                                                  |                                                                                                           | POWER10                                                |
|----------------------------|-----------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
|                            | 2010<br>POWER7<br>8 cores<br>45nm<br>New Micro-<br>Architecture | 2012<br>POWER7+<br>8 cores<br>32nm<br>Enhanced<br>Micro-<br>Architecture<br>New Process<br>Technology | 2014<br>POWER8<br>12 cores<br>22nm<br>New Micro-<br>Architecture | 2016<br>POWER8<br>w/ NVLink<br>12 cores<br>22nm<br>Enhanced<br>Micro-<br>Architecture<br>With NVLink | 2017<br>P9 SO<br>12/24 cores<br>14nm<br>New Micro-<br>Architecture<br>Direct attach<br>memory<br>New Process | 2018<br>P9 SU<br>12/24 cores<br>14nm<br>Enhanced<br>Micro-<br>Architecture<br>Buffered<br>Memory | 2020<br>P9 AIO<br>12/24 cores<br>14nm<br>Enhanced<br>Micro-<br>Architecture<br>New<br>Memory<br>Subsystem | 2021<br>P10<br>TBA cores<br>New Micro-<br>Architecture |
| Sustained Memory Bandwidth | Up To<br>65 GB/s                                                | Up To<br>65 GB/s                                                                                      | Up To<br>210 GB/s                                                | Up To<br>210 GB/s                                                                                    | Technology<br>Up To<br>150 GB/s                                                                              | Up To<br>210 GB/s                                                                                | Up To<br>650 GB/s                                                                                         | Up To<br>800 GB/s                                      |
| Standard I/O Interconnect  | PCle Gen2                                                       | PCle Gen2                                                                                             | PCle Gen3                                                        | PCle Gen3                                                                                            | PCle Gen4 x48                                                                                                | PCle Gen4 x48                                                                                    | PCIe Gen4 x48                                                                                             | PCle Gen5                                              |
| Advanced I/O Signaling     | N/A                                                             | N/A                                                                                                   | N/A                                                              | 20 GT/s<br>160GB/s                                                                                   | 25 GT/s<br>300GB/s                                                                                           | 25 GT/s<br>300GB/s                                                                               | 25 GT/s<br>300GB/s                                                                                        | 32 & 50 GT/s                                           |
| Advanced I/O Architecture  | N/A                                                             | N/A                                                                                                   | CAPI 1.0                                                         | CAPI 1.0 ,<br>NVLink                                                                                 | CAPI 2.0,<br>OpenCAPI3.0,<br>NVLink                                                                          | CAPI 2.0,<br>OpenCAPI3.0,<br>NVLink                                                              | CAPI 2.0,<br>OpenCAPI4.0,<br>NVLink                                                                       | ТВА                                                    |

#### Statement of Direction, Subject to Change

### Power Systems Proposed POWER Processor Technology and I/O Roadmap



### Focus of today's talk

|                            | POWER7 Architecture                          |                                               | POWER8 Architecture                           |                                                   | POWER9 Architecture                                   |                                                          |                                                                  | POWER10                                   |
|----------------------------|----------------------------------------------|-----------------------------------------------|-----------------------------------------------|---------------------------------------------------|-------------------------------------------------------|----------------------------------------------------------|------------------------------------------------------------------|-------------------------------------------|
|                            | 2010<br>POWER7<br><sup>8 cores</sup><br>45nm | 2012<br>POWER7+<br><sup>8 cores</sup><br>32nm | 2014<br>POWER8<br><sup>12 cores</sup><br>22nm | 2016<br>POWER8<br>w/ NVLink<br>12 cores<br>22nm   | 2017<br>P9 SO<br>12/24 cores<br>14nm                  | 2018<br>P9 SU<br><sup>12/24</sup> cores<br>14nm          | 2020<br>P9 AIO<br><sup>12/24</sup> cores<br>14nm                 | 2021<br>P10<br>TBA cores                  |
|                            | New Micro-<br>Architecture                   | Enhanced<br>Micro-<br>Architecture            | New Micro-<br>Architecture                    | Enhanced<br>Micro-<br>Architecture<br>With NVLink | New Micro-<br>Architecture<br>Direct attach<br>memory | Enhanced<br>Micro-<br>Architecture<br>Buffered<br>Memory | Enhanced<br>Micro-<br>Architecture<br>New<br>Memory<br>Subsystem | New Micro-<br>Architecture<br>New Process |
|                            | Technology                                   | chnology Technology                           | Technology                                    |                                                   | New Process<br>Technology                             |                                                          | Subsystem                                                        | Technology                                |
| Sustained Memory Bandwidth | Up To<br>65 GB/s                             | Up To<br>65 GB/s                              | Up To<br>210 GB/s                             | Up To<br>210 GB/s                                 | Up To<br>150 GB/s                                     | Up To<br>210 GB/s                                        | Up To<br>650 GB/s                                                | Up To<br>800 GB/s                         |
| Standard I/O Interconnect  | PCle Gen2                                    | PCle Gen2                                     | PCle Gen3                                     | PCle Gen3                                         | PCIe Gen4 x48                                         | PCle Gen4 x48                                            | PCle Gen4 x48                                                    | PCIe Gen5                                 |
| Advanced I/O Signaling     | N/A                                          | N/A                                           | N/A                                           | 20 GT/s<br>160GB/s                                | 25 GT/s<br>300GB/s                                    | 25 GT/s<br>300GB/s                                       | 25 GT/s<br>300GB/s                                               | 32 & 50 GT/s                              |
| Advanced I/O Architecture  | N/A                                          | N/A                                           | CAPI 1.0                                      | CAPI 1.0 ,<br>NVLink                              | CAPI 2.0,<br>OpenCAPI3.0,<br>NVLink                   | CAPI 2.0,<br>OpenCAPI3.0,<br>NVLink                      | CAPI 2.0,<br>OpenCAPI4.0,<br>NVLink                              | ТВА                                       |

#### Statement of Direction, Subject to Change

### Power Systems Proposed POWER Processor Technology and I/O Roadmap



#### Looking forward POWER10 **POWER7** Architecture **POWER8** Architecture **POWER9** Architecture 2018 2020 2012 2014 2016 2017 2021 2010 **P9 SU** POWER8 POWER8 **P9 SO** P9 AIO **P10** POWER7 POWER7+ w/ NVLink 12/24 cores 12/24 cores 12/24 cores **TBA cores** 12 cores 8 cores 8 cores 22nm 12 cores 14nm 14nm 14nm 45nm 32nm 22nm Enhanced New Micro-New Micro-Enhanced New Micro-Enhanced Enhanced New Micro-Micro-Micro-Architecture Micro-Architecture Architecture Micro-Architecture Architecture Architecture Architecture Architecture With NVLink Direct attach New Buffered memory Memory New Process **New Process** New Process Memory New Process Subsystem Technology New Process Technology Technology Technology Technology Up To 800 GB/s Sustained Memory Bandwidth 65 GB/s 65 GB/s 210 GB/s 150 GB/s 210 GB/s 650 GB/s 210 GB/s Standard I/O Interconnect PCIe Gen4 x48 PCIe Gen4 x48 PCle Gen5 PCIe Gen2 PCle Gen2 PCIe Gen3 PCle Gen3 PCIe Gen4 x48 20 GT/s 25 GT/s 25 GT/s 25 GT/s 32 & 50 GT/s Advanced I/O Signaling N/A N/A N/A 160GB/s 300GB/s 300GB/s 300GB/s CAPI 2.0, CAPI 2.0, CAPI 2.0, CAPI 1.0, OpenCAPI3.0, OpenCAPI3.0. OpenCAPI4.0, TBA Advanced I/O Architecture **CAPI 1.0** N/A N/A NVLink NVLink NVLink NVLink

#### Statement of Direction, Subject to Change







**Power Systems** 

- Extreme Processor / Accelerator Bandwidth and Reduced Latency
- Coherent Memory and Virtual Addressing Capability for all Accelerators
- OpenPOWER Community Enablement Robust Accelerated Compute Options
- State of the Art I/O and Acceleration Attachment Signaling
  - PCle Gen 4 x 48 lanes 192 GB/s duplex bandwidth
  - 25 G Common Link x 96 lanes 600 GB/s duplex bandwidth
- Robust Accelerated Compute Options with OPEN standards
  - On-Chip Acceleration Gzip x1, 842 Compression x2, AES/SHA x2
  - CAPI 2.0 4x bandwidth of POWER8 using PCIe Gen 4
  - **NVLink** Next generation of GPU/CPU bandwidth
  - **OpenCAPI** High bandwidth, low latency and open interface
  - **OMI** High bandwidth and/or differentiated for acceleration





**POWER9** 

**PowerAccel** 

6



IBM

# THE WORLD'S TWO MOST POWERFUL SUPERCOMPUTERS

BUILT FOR THE AI ERA WITH OPEN COLLABORATION









- Designed to support range of devices
  - Coherent Caching Accelerators
  - Network Controllers
  - Differentiated Memory
    - o High Bandwidth
    - $\circ$  Low Latency
    - o Storage Class Memory
  - Storage Controllers



- · Asymmetric design, endpoint optimized for host and device attach
  - **ISA of Host Architecture**: Need to hide difference in Coherence, Memory Model, Address Translation, etc.
  - **Design schedule:** The design schedule of a high performance CPU host is typically on the order of multiple years, conversely, accelerator devices have much shorter development cycles, typically less than a year.
  - **Timing Corner:** ASIC and FPGA technologies run at lower frequencies and timing optimization as CPUs.
  - Plurality of devices: Effort in the host, both IP and circuit resource, have a multiplicative effect.
  - **Trust:** Attached devices are susceptible to both intentional and unintentional trust violations
  - Cache coherence: Hosts have high variability in protocol. Host cannot trust attached device to obey rules.





- Low Latency and High Bandwidth
  - Fixed width DL CRC
  - Aligned TL
  - Aligned Data
  - Separately pipelined control/tag vs data
    - $\circ$   $\,$  Compromise in switching capability  $\,$

| Bytes(63:0) |                               |  |  |  |  |  |
|-------------|-------------------------------|--|--|--|--|--|
| DL content  | TL command / response content |  |  |  |  |  |
|             | Data flit 0                   |  |  |  |  |  |
|             | Data flit 1                   |  |  |  |  |  |
|             | Data flit 2                   |  |  |  |  |  |
|             | Data flit 3                   |  |  |  |  |  |
|             | Data flit 4                   |  |  |  |  |  |
| Data flit 5 |                               |  |  |  |  |  |
|             | Data flit 6                   |  |  |  |  |  |
|             | Data flit 7                   |  |  |  |  |  |
| DL content  | TL command / response content |  |  |  |  |  |
|             | Data flit 0                   |  |  |  |  |  |
|             | Data flit 1                   |  |  |  |  |  |
| DL content  | TL command / response content |  |  |  |  |  |
| DL content  | TL command / response content |  |  |  |  |  |



### **POWER9 Family Memory Architecture**







### **Primary Tier Memory Options**





© 2019 IBM Corporation





#### **Processor Chip Details**

- 728 mm<sup>2</sup> (25.3 x 28.8 mm)
- 8 Billion Transistors
- Up to 24 SMT4 Cores
- Up to 120 MB eDRAM L3 cache

#### **Semiconductor Technology**

- 14nm finFET
- Improved device performance
- Reduced energy
- eDRAM
- 17 layer metal stack

#### **High Bandwidth Signaling**

- 25 GT/s low energy differential
  - PowerAXON, OMI memory
- 16 GT/s low energy differential
  - Local SMP
- 16 GT/s PCIe Gen4

### The Bandwidth Beast Advanced I/O (AIO)



### 2 TB/s Raw Signaling Bandwidth Shared by 6 Attach Protocols

#### **Open Memory Interface (OMI)**

- 16 channels x8 at 25 GT/s
- 650 GB/s peak 1:1 r/w bandwidth
- Technology Agnostic
- Offered w/ Microchip DDR4 buffer (410 GB/s peak bandwidth)

#### **PowerAXON 25 GT/s Attach**

- Up to 16 socket glue-less SMP (4x24 SMP added to 3x30 local)
- Up to x48 NVIDIA NVLINK GPU attach
- Up to x48 OpenCAPI 4.0 coherent accelerator / memory attach

#### Industry Standard I/O Attach

- x48 PCIe Gen 4 at 16 GT/s
- Up to x16 CAPI 2.0 coherent accelerator / storage attach







### **Roadmap of Capabilities and Host Silicon Delivery**

| Accelerator Protocol      | CAPI 1.0               | CAPI 2.0               | OpenCAPI 3.0                | OpenCAPI 4.0                | OpenCAPI 5.0                   |  |
|---------------------------|------------------------|------------------------|-----------------------------|-----------------------------|--------------------------------|--|
| First Host Silicon        | POWER8<br>(GA 2014)    | POWER9 SO<br>(GA 2017) | POWER9 SO<br>(GA 2017)      | POWER9 AIO<br>(GA 2020)     | POWER10<br>(GA 2021)           |  |
| Functional Partitioning   | Asymmetric             | Asymmetric             | Asymmetric                  | Asymmetric                  | Asymmetric                     |  |
| Host Architecture         | POWER                  | POWER                  | Any                         | Any                         | Any                            |  |
| Cache Line Size Supported | 128B                   | 128B                   | 64/128/256B                 | 64/128/256B                 | 64/128/256B                    |  |
| Attach Vehicle            | PCle Gen 3<br>Tunneled | PCIe Gen 4<br>Tunneled | 25 G (open)<br>Native DL/TL | 25 G (open)<br>Native DL/TL | 32/50 G (open)<br>Native DL/TL |  |
| Address Translation       | On Accelerator         | Host                   | Host (secure)               | Host (secure)               | Host (secure)                  |  |
| Native DMA to Host Mem    | No                     | Yes                    | Yes                         | Yes                         | Yes                            |  |
| Atomics to Host Mem       | No                     | Yes                    | Yes                         | Yes                         | Yes                            |  |
| Host Thread Wake-up       | No                     | Yes                    | Yes                         | Yes                         | Yes                            |  |
| Host Memory Attach Agent  | No                     | No                     | Yes                         | Yes                         | Yes                            |  |
| Low Latency Short Msg     | 4B/8B MMIO             | 4B/8B MMIO             | 4B/8B MMIO                  | 128B push                   | 128B push                      |  |
| Posted Writes to Host Mem | No                     | No                     | No                          | Yes                         | Yes                            |  |
| Caching of Host Mem       | RA Cache               | RA Cache               | No                          | VA Cache                    | VA Cache                       |  |



### **POWER9 Connectivity Variants**









### IBM Centaur DIMM





OMI DDIMM

- Technology agnostic
- Low cost
- Ultra-scale system density
- Enterprise reliability
- Low-latency
- · High bandwidth





- Signaling: 25.6GHz vs DDR4 @ 3200 MHz
  - 4x raw bandwidth per I/O signal
  - 1.3x mixed traffic utilization
- Idle load-to-use latency over traditional DDR:
  - POWER8/9 Centaur design ~10 ns
  - OMI target of ~5-10 ns (RDIMM)
  - OMI target of < 5ns (LRDIMM)</li>
- IBM Centaur: One proprietary DMI design
- Microchip SMC 1000:
  - Open (OMI) design
  - Emerging JEDEC Standard



### 8x25G Open Memory Interface (OMI) Serial DDR4 Smart Memory Controller





### **OMI Memory Latency**





Microchip's SMC 1000 8x25G features an innovative low latency design that delivers less than four ns incremental latency over a traditional integrated DDR controller with LRDIMM. This results in OMI-based DDIMM products having virtually identical bandwidth and latency performance to comparable LRDIMM products.

6X bandwidth / PHY area advantage gives POWER9 AIO bandwidth of 16 DDR ports

© 2019 IBM Corporation





### **PowerAXON**

### The Bandwidth Beast POWER9 with Advanced I/O (AIO)









**Thank You!** 

## **OMI Memory**



