

# Centip3De: A 64-Core, 3D Stacked, Near-Threshold System

Ronald G. Dreslinski

David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, Michael Wieckowski, Gregory Chen, Trevor Mudge, Dennis Sylvester, David Blaauw

University of Michigan



### **The Problem of Power**



#### **Today: Super-V<sub>th</sub>, High Performance, Power Constrained**





Normalized CPU Metrics

Large gate overdrive favors performance with unsustainable power density

Must design within fixed TDP

Goal: maintain performance, improved Energy/Operation

### **Subthreshold Design**





Operating in sub-threshold yields large power gains at the expense of performance.

Applications: sensors, medical

### **Subthreshold Design**





Phoenix 2 Processor, ISSCC'10

Operating in sub-threshold yields large power gains at the expense of performance.

Applications: sensors, medical

# **Near-Threshold Computing (NTC)**



### **Architectural Impact of NTC**



- Caches have higher Vopt and operating frequency
- Smaller activity rate when compared to core logic
- Leakage larger proportion of total power in caches
- New Architectures Possible

# **Proposed NTC Architecture**

- SRAM is run at a higher V<sub>DD</sub>
  - Caches operate faster than core
- Can introduce clustered architecture
  - Multiple cores share L1
  - Cores see private L1
  - L1 still provides single-cycle latency
- Advantages:
  - Less coherence/snoop traffic
  - Larger cache for processes that need it
- Drawbacks:
  - Core conflicts evicting L1 data
    - Not dominant in simulation
  - Longer interconnect
    - 3D addressable





# **Proposed Boosting Approach**

Measured results for 130nm LP design 10MHz becomes ~110MHz in 32nm simulation 140 FO4 delay core

#### **Baseline**

- Cache runs 4x core frequency
- Pipelined cache

#### **Better Single Thread Performance**

- Turn some cores off, speed up the rest
- Cache de-pipelined
- Faster response time, same throughput
- Core sees larger cache
  - Faster cores needs larger caches



# **Cache Timing**

#### NTC Mode (3/4 Cores)

Low power Tag arrays read first 0-1 data arrays accessed

#### Boost Mode (1/2) Low latency Data and tags read in parallel 4 data arrays accessed



# **Cache Timing**





NTC Mode (3/4 Cores) Low power Tag arrays read first 0-1 data arrays accessed

# **Cache Timing**



#### Boost Mode (1/2)

Low latency Data and tags read in parallel 4 data arrays accessed





- 7-Layer NTC system
  - 2-Layer system
    completed
    fabrication
    with measured
    results
- Full 7-layer system expected End of 2012





- Up to two pairs
- 16 clusters per pair
- Cores have only vertical interconnections

- Bus interconnect architecture
  - Up to 500 MHz
  - 9-11 cycle latency
  - 1-3 core cycles
- 8 lanes, each 128b
  - One per DRAM interface
  - Each cluster connects to all eight
  - 1024b total
- Vertically connected Tezzaron through all four layers



Flipping interface enables 128-core system

- **3D-Stacked DRAM** [065] Cortex M3 [092] Cortex M3 [094] Cortex M3 [067 Top **Disabled Due** M3 M3 M3 M3 Core -M3 [099] To Redundancy M3 M3 [097] M3 M3 [126] **Tezzaron Octopus** Layer Cortex Cortex [096] [098] ŝ 25 **I\$/D\$ I\$/D\$** 1\$/D\$ 1\$/D\$ • [16] [24] [23] [31] **Cache Bus Hub Cache Bus Hub** Top 8x8 Crossbar 8x8 Crossbar Cache ٠ Layer . . . . . . . . . ..... **Flipping Interface** Flipping Interface **Flipping Interface Flipping Interface** 1111111 Bottom Cache Cache Bus Hub Cache Bus Hub Layer 8x8 Crossbar 8x8 Crossbar 1 Gb bitcell layers 1\$/D\$ **I\$/D\$ I\$/D\$** 1\$/D\$ • • ٠ [00] [07] [08] [15] [029] Cortex M3 [030] [031] Cortex Cortex Cortex ex M3 [001] Cortex M3 [033] Cortex M3 [028] Cortex M3 003 **DRAM Bus Hub DRAM Bus Hub** 8x4 Crossbar 8x4 Crossbar Bottom M3 M3 M3 M3 M3 [035] Core M3 M3 Ma M<sub>3</sub> Layer [032] [034 062 erfac **DRAM Control Layer** Tezzaron ----Octopus DRAM Bitcell Layer DRAM DRAM Bitcell Layer
  - **DRAM System**
  - Operated at bus frequency (up to 500 MHz)

interfaces

1 control layer

130nm CMOS

Up to two layers

**DRAM** process

8x 128b DDR2



1505

18/03

M3

15/08

1.8.0.8

50

5031

15/03

18/05

13

S

SOS

NIS MIS

15:00

555

590

130nm process 12.66x5mm per layer 28.4M device core layer 18.0M device cache layer

### **2-Layer Stacking Process Evaluated**



# For the measured 2-layer system, aluminum wirebond pads were used instead

#### **Cache 3D Connections**



#### **Core 3D Connections**



#### **Cluster 3D Connections**



#### **Silicon Results**



### **Die Shot**



130nm process12.66x5mm per layer28.4M device core layer18.0M device cache layer

# **System Configurations**



# **Measured Results**



#### **Measured Results**



#### **Measured Results**



Measured Results: Centip3De – 3,930 (130nm)

Industry Comparison: ARM A9 – 8,000 (40nm) [1]

Estimated Results: Centip3De – 18,500 (45nm)

[1] http://arm.com/products/processors/cortex-a/cortex-a9.php, ARM Ltd, 2011.

# Conclusion

- Near threshold computing (NTC)
  - Need low power solutions to maintain TDP
  - Achieves 10x energy efficiency => 10x more computation to give TDP
  - Offers optimum balance between performance and energy
  - Allows boosting for single threaded performance (Amdahl's law)
- Large scale 3D CMP demonstrated
  - 64 cores currently
  - 128 cores + DRAM in the future
  - 3D design shown to be feasible

This work was funded and organized with the help of DARPA, Tezzaron, ARM, and the National Science Foundation