#### Hot Chips 21

# POWER7: IBM's Next Generation, Balanced POWER Server Chip William J. Starke POWER7 Chief Storage Hierarchy and SMP Architect

Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002



2

#### IBM

#### **Challenge: Beating Physics to Realize Multi-core Potential**



POWER7<sup>™</sup> is an 8-core, high performance Server chip. A solid chip is a good start. But to win the race, you need a balanced system. POWER7 enables that balance.

#### **Challenge: Beating Physics to Realize Multi-core Potential**



#### **Trends in Server Evolution** Virtualized/Cloud **Single Image Emerging Entry Server** Virtualized/Cloud Platform Enabled by: 8-core 8-core - Technology - Innovation **Driven by:** 8-core 8-core - IT Evolution Time - Economics 2 to 4 socket 16 to 32-way SMP Server **Traditional Entry Server Traditional High-End Server** Single Image Platform Virtualized Consolidation Platform 2-core 2-core 2-core 2-core 2 to 4 socket 8 to 32 socket 4 to 8-way SMP Server 16 to 64-way SMP Server

- A simple matter of riding the multi-core trend?

- Add more cores to the die, beef up some interfaces, and scale to a large SMP?

> \* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

4

#### Trends in Server Evolution Virtualized/Cloud **Single Image** - A simple matter of riding **Emerging Entry Server** Virtualized/Cloud Platform the multi-core trend? - Add more cores to the die, beef up some interfaces, Enabled by: 8-core 8-core and scale to a large SMP? - Technology - Innovation Not so simple: **Driven by:** 8-core - Emerging entry servers Similar - IT Evolution have characteristics similar Time - Economics 2 to 4 socke to traditional high-end 16 to 32-way SMP large SMP servers Challeng **Traditional Entry Server** Traditiona End Server Single Image Platform Virtualized Co ion Platform Achieving solid virtual machine performance 2-core 2-core requires a Balanced System Structure. 2-core 2-core 8 to 32 socket 2 to 4 socket Statements regarding SMP servers 4 to 8-way SMP Server 16 to 64-way SMP Server do not imply that IBM will introduce a system with this capability.

#### **Trends in Server Evolution**



#### **Challenge: How does POWER7 maintain the Balance?**

Need to Amplify Effective Socket Throughput to Close Gap and Achieve Potential **Compute Throughput Potential** 

#### Cache Hierarchy Technology and Innovation

 Socket Throughput Limitation (Physical signal economics)

Multi-core evolution

#### Cache Hierarchy Rqmt for POWER<sup>®</sup> Servers



#### Challenge for Multi-core POWER7

POWER4<sup>™</sup>, POWER5<sup>™</sup>, and POWER6<sup>™</sup> systems derive huge benefit from high bandwidth access to large, off-chip cache.

But socket pin count constraints prevent scaling the off-chip cache interface to support 8 cores.

#### Solution: High speed eDRAM on the processor die



Industry Standard Caching and Memory Technologies: Conventional DIMMs, Dense and Fast SRAM's.

IBM's POWER Servers have leveraged large off-chip eDRAM caches in POWER4, 5, and 6.

With POWER7, IBM introduces on-processor, high-speed, custom eDRAM, combining the dense, low power attributes of eDRAM with the speed and bandwidth of SRAM.





#### Solution: Hybrid L3 "Fluid" Cache Structure





#### Solution: Hybrid L3 "Fluid" Cache Structure



#### Solution: Hybrid L3 "Fluid" Cache Structure



- Enables a subset of the cores to utilize the entire large shared L3 cache when the remaining cores are not using it.





#### Solution: L2 "Turbo" Cache



- L2 "Turbo" cache keeps a tight 256K working set with extremely low latency (~3X lower than local L3 region) and high bandwidth, reducing L3 power and boosting performance.



#### **Cache Hierarchy Summary**



| Cache Level    | Capacity | Array     | Policy         | Comment                            |  |
|----------------|----------|-----------|----------------|------------------------------------|--|
| L1 Data        | 32K      | Fast SRAM | Store-thru     | Local thread storage update        |  |
| Private L2     | 256K     | Fast SRAM | Store-In       | De-coupled global storage update   |  |
| Fast L3 Region | Up to 4M | eDRAM     | Partial Victim | Reduced power footprint (up to 4M) |  |
| Shared L3      | 32M      | eDRAM     | Adaptive       | Large 32M shared footprint         |  |

#### **Challenge: How does POWER7 maintain the Balance?**



#### **Advances in Memory Subsystem**

#### Memory Subsystem Rqmt for POWER Servers



**Energy Constraints** 

#### Challenge for Multi-core POWER7

#### **Socket Challenge:**

4x growth in memory bandwidth and capacity needed per socket.

#### **System Challenge:**

Packaging more memory into similar volume with similar energy and cooling constraints.

#### Advances in Memory Subsystem

#### **Multi-faceted Solution**



#### **Advances in Memory Subsystem**



#### **Challenge: How does POWER7 maintain the Balance?**



#### **Advances in Off-chip Signaling Technology**

# Enhanced Signal-ended "Elastic Interface" Technology New high speed, low power Differential Technology

| Interface        | Signal Type  | Info Width | Frequency | Bandwidth |
|------------------|--------------|------------|-----------|-----------|
| Off-chip Cache   | none         | none       | none      | none      |
| Memory Channels  | Differential | 28 bytes   | 6.4 Ghz   | 180 GB/s  |
| I/O Bridge       | Single-ended | 20 bytes   | 2.5 Ghz   | 50 GB/s   |
| SMP Interconnect | Single-ended | 120 bytes  | 3.0 Ghz   | 360 GB/s  |
| Total Bandwidth  |              |            |           | 590 GB/s  |

(Note that bandwidths shown are raw, peak signal bandwidths)

 Moving L3 onto POWER7 along with advances in signaling technology enables significant raw bandwidth growth for both memory and I/O subsystems. Note that advanced scheduling improves POWER7's ability to utilize memory bandwidth.

#### **Challenge: How does POWER7 maintain the Balance?**



#### **Exploit Long Term Investment in Coherence Innovation**



Using local and remote SMP links, up to 32 POWER7 chips are connected



#### **Exploit Long Term Investment in Coherence Innovation**



#### Up to 32 POWER7 chips form a massive SMP system.

\* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

28

#### **Exploit Long Term Investment in Coherence Innovation**

#### **Coherence Protocol Features**

- POWER storage Architecture enables decoupled global storage updates. Updates can be reordered and are effectively "deserialized".
- Decentralized coherence resolution, and bounded latency broadcast transport layer.

#### **POWER7 Exploitation**

 POWER Servers can drive massive coherence throughput. A 32-chip POWER7 system can manage over 20,000 concurrently reordered coherent storage operations (~4X more than POWER6 systems), with minimal tracking overhead per operation.

- Decentralized coherence resolution, advanced cache states, optimized on-chip transport, and broadcast free barriers.
- Low latency intervention, high performance locking constructs, and robust scaling.

Key Ingredients for Balanced Scaling in Traditional POWER Servers:

- Architecture enables re-ordered, decoupled storage updates
- Decentralized coherence resolution
- Broadcast transport layer

\* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

## TEM

a system with this capability.

## Exploit Long Term Investment in Coherence Innovation

Challenge: As system size grows, Coherence broadcast traffic increases



30

#### **Exploit Long Term Investment in Coherence Innovation**

Solution: Speculative limited scope Coherence broadcast

- In 2003, recognized emerging trend
- Developed Dual-Scope Broadcast Coherence Protocol for POWER6
- Utilizes 13 cache states and integrated scope indicator in memory



#### **Provides value for POWER6**

- Latency reduction
- Near Perfect Scaling for extreme memory intensive workloads
- Ultra-dense packaging (Power 575)

#### **Necessity for POWER7**

- 450 GB/s must grow to 1.6 TB/s to match POWER6 scaling
- 450 GB/s -> 3.6 TB/s theoretical peak
- 3.6 TB/s -14.4 TB/s with chip scope

<sup>\*</sup> Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

Multi-core evolution

#### **Conclusion: POWER7 maintains the Balance**

Achieves extreme Multi-core throughput while providing Balance and SMP scaling IBM customers expect, by building on a foundation of solid innovation. Compute Throughput Potential

Exploit Long Term Investment in Coherence Innovation

Advances in Off-Chip Signaling Technology

Advances in Memory Subsystem

Cache Hierarchy Technology and Innovation

 Socket Throughput Limitation (Physical signal economics)

IBM POWER chips uniquely positioned to excel given the emerging trends:

- 1) History of large SMP leadership
- 2) Storage Architecture economics
- 3) High density packaging leadership