# Stratix<sup>®</sup> 10: 14nm FPGA Delivering 1GHz

Mike Hutton Product Architect, Altera IC Design

HotChips 2015



### **Acknowledgements**

### The Stratix 10 Architecture and Definition Teams

- Herman Schmit, Dana How, Gordon Chiu, Carl Ebeling, Bruce Pedersen, Andy Lee, Martin Langhammer, Ben Gamsa
- Dave Lewis, Valavan Manohararajah, David Galloway, Jeff Chromczak, Tim Vanderhoek, Ian Milton
- Sean Atsatt, David Shippy, Arif Rahman, Mark Chan, Jeff Schultz, Richard Grenier, Steven Perry, Jiefan Zhang, Rita Chu, Ting Lu
- KS Foo, Chee Hak Teh, Lai Guan Tang
- Bernhard Friebe, Eleena Ong, Jordon Inkeles, Allan Davidson, Lux Joshi, Martin Won



### **Stratix 10 Architectural Big Rocks**

- < 2X the achievable performance of Stratix V
- At up to 70% lower power
- Heterogeneous 3D System In Package (SiP) integration
- Adoption of Intel 14nm tri-Gate process
- Hierarchical configuration and security

### Stratix 10 Innovations: 2.5D





# Stratix 10 Innovations: Configuration & Clocking



Multi-Die via EMIB



# Stratix 10 Innovations: HyperFlex



#### < Multi-Die via EMIB

- Separate core / transceiver
- Embedded Multi-Die Interconnect Bridge

#### Scalable Sector Architecture

- Software Configuration
- Configuration NoC
- Routable Clocks

#### Core Performance

- HyperFlex Fabric, Tri-Gate
- 1GHz M20K and DSP MAC
- 750 MHz Floating Point



# Stratix 10 Innovations: SoC & Memory



#### < Multi-Die via EMIB

- Separate core / transceiver
- Embedded Multi-Die Interconnect Bridge
- Scalable Sector Architecture
  - Software Configuration
  - Configuration NoC
  - Routable Clocks
- < Core Performance
  - HyperFlex Fabric, Tri-Gate
  - 1GHz M20K and DSP MAC
  - 750 MHz Floating Point

< SoC

- 1.5GHz Performance ARM<sup>®</sup> Cortex A53
- DDR I/O Banks with Hard Memory Controller



### **System-in-Package Construction**





# System-in-Package EMIB Technology



< Many benefits

- Reduced complexity vs. full interposer, and no reticle limit
- De-Couple analog (transceiver) development from digital FPGA fabric
- Transceiver reliability & yield enhancement
  - Con't need rectangular "die"
  - Matching transceiver speed-grades
- Tick mixed with tock for added derivatives
  - E.g. 56G PAM4 transceiver tile
  - E.g. new hardened I/O interface IP
  - E.g. SiP Memory or ASIC tiles





### **Sectors and Configuration Sub-System**

- Config-System manages CRAM
- Historically just a shift register
  - AR/DR to Configuration RAM Array
  - FSM controlled



Modern configuration adds significant system functionality

- Encryption, decryption, bitstream compression, redundancy
- Security: authentication, side-Channel, firewall, PUF
- SEU, scrubbing and partial-configuration management
- Debug
- Our solution: move it to software
  - More robust, upgradable, and risk-averse



# **Stratix 10 Configuration Sub-system Overview**

# Secure Device Manager (SDM)

- Config and Re-Config, compression
- Security: authentication, encryption, PUF
- Maintenance (power, T/V, SEU, debug)

# Local Sector Manager (LSM)

Sector configuration manager

# Config Network-on-Chip (CNoC)

SDM/LSM Communication







### **SoC Application Processor**

# ✓ 1.5 GHz Quad-Core ARM<sup>®</sup> Cortex<sup>™</sup> A53

- CCU: Cache-coherency between FPGA accelerators and processors
- Integrated with configuration subsystem (SDM) sharing peripherals





## **Routable Clocking**

- SW-routed clocks in sector "seams"
- More efficient use of globals
- Active skew management







L

А

# **Core Fabric Building Blocks**



LUT-ou

Packed-register

Adaptive Logic Module

Μ

Е

А

L

Α

L

А



**1GHz RAM Data Forwarding** 

D

S

L

А



#### **1GHz DSP MAC** 10 TFLOPs IEEE 754



L

А

Μ

Е

L

А



LIM/LEIM/DIM fabric



### **Fabric Performance and Power**

## Performance:

- A complete "re-think" on the philosophy of FPGA Fabric Architecture
- Registers are not just logic resources, they are routing resources
- Goal is to enable seamless movement and addition (pipelining) of registers
- Target: 2X the performance, without making the wires "2x faster"

### < Power

- 14nm Tri-Gate process (FinFET) provides process benefit for power
- Expanded use of VID and power management adds more
  - High-Performance 800 mV to 940 mV
  - Low-Power options from 850 mV down to 800 mV
- HyperFlex for power reduction
  - Combine performance from HyperFlex with low-power options
- **Target**: 50% to 70% lower power per function, without slowing down



# **Re-timing and Pipelining in Conventional FPGAs**



#### Raw Logic

- Unbalanced paths



#### **Re-Timing**

- Balance flops
- 16% f<sub>max</sub> gain
- Added area

#### Pipelining

- Add flops
- 40% f<sub>max</sub> gain
- Added clock tick
- Added area



# HyperFlex: Pipeline Registers by Design



Routing muxes (all H/V wires) have optional registers

Including LAB, M20K and DSP block inputs, CC, SCLR/CE

# Architectural Goals:

- Perfect balance P&R chooses the right register (of many) to turn on
- Simple Software Re-timing is a simple push/pull along the path
- No wasted LEs Designs with high FF:LUT ratios no longer an issue
- No wasted routing Don't have to route to find an available FF

# Moving a Register in the HyperFlex fabric

# Oisable in ALM, add to routing



Moving a register is a push/pull operation on the route
 There is always a register on the routing mux
 Quartus<sup>®</sup> II chooses the most appropriate FF for path balancing



# **Re-timing and Pipelining in Stratix 10**



### **Software and Designer Use-Model**

Software adds a new step



Concentrate on critical domains/chains, not volatile reg-reg paths

HyperPipeline the data-path, optimize control logic





Traditional: Similar Criticality

More Critical



### **Power: Half the power per function**

< 14nm Tri-Gate provides a good chunk of this

- Allows us to take more of the process benefit as performance
- Expanded use of VID and power management
  - M20K and DSP block power gating
- Added registers helps:
  - Reduced footprint for register-heavy designs
- At 2X the speed, reduced size
  - Half the width means half the area
  - Which means half the static power on the same device.



Standard Static Power Limi





### **Area/Delay/Power Tradeoffs with Stratix 10**



Power



### Summary

# 3D integration isn't just integration, it is

- De-risking, process matching, derivative proliferation and tick/tock
- Control Con
  - Software control allows for security and feature-up of devices
- SoC integration is mainstream
  - Processor cost is a small subset of the die, coherent-accelerators
- Pipelining unlocks optimizations in FPGA architecture
  - Using wires efficiently, not brute-forcing them faster
  - Faster == lower power when you can get designs to a more efficient place
- Process is still giving us power benefits
  - 14nm Tri-Gate reduces power, enabling higher performance circuit-design



# **Thank You**

© 2015 Altera Corporation—Public

All rights reserved. ALTERA, ARRIA, CYCLONE, ENPIRION, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation and registered in the U.S. Patent and Trademark Office and in other countries. All other words and logos identified as trademarks or service marks are the property of their respective holders as described at www.altera.com/legal.

