

# 4<sup>th</sup> Generation Intel® Core<sup>™</sup> Processor, codenamed Haswell

Per Hammarlund

Haswell Chief Architect, Intel Fellow

August, 2013

1

### **Legal Notices and Disclaimers**

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchase, including the performance of that product when combined with other products.

Intel, Core i7, Core i5, Core i3, Ultrabook, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

\* Other names and brands may be claimed as the property of others.

Copyright © 2013 Intel Corporation.





- Power Efficiency and Management
- FIVR Fully Integrated Voltage Regulator
- Cache Hierarchy and Interconnects
- Gfx/Media
- Intel<sup>®</sup> Microarchitecture (Haswell): Core
- ISA
- Wrap Up



- **Huge family**: SOC methodology, common architecture
- **Low power platform**: 20x idle power reduction, low power IO (I2C, SDIO, I2S, UART), Link power management (USB, PCIe, SATA)
- Large eDRAM Cache
- **Platform**: PSR (Panel Self Refresh)
- FIVR: Fully Integrated Voltage Regulator
- **Core**: FMA (Floating-point Multiply Add), 2x Cache BW, TSX (Transaction Synchronization Extention)
- **Graphics**: 2x in Ultrabooks, OpenCL 1.2, DX 11.1, OpenGL 4.0
- Media: 5x faster at 0.5x power

#### **Modularity Options**

| Value                 | Range                              |
|-----------------------|------------------------------------|
| Core Count            | 2-4                                |
| Graphics              | GT1, GT2,<br>GT3                   |
| Active Power<br>Level | Tablet to<br>Desktop               |
| Idle Power            | Variable                           |
| Cache Size            | Variable                           |
| Interconnects         | Variable                           |
| Platforms             | Traditional,<br>power<br>optimized |



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.<sup>3</sup> Performance tests, such as SYSmark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go http://www.intel.com/performance

### Intel Process 22nm Process Technology and Tick/Tock Development Model

| 45nm Process<br>Technology                                                                                                                                                                                                                                                                                                        | 32nm Process Technology                 |                                                  | 22nm Process Te                              | echnology                                   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|--------------------------------------------------|----------------------------------------------|---------------------------------------------|
| Nehalem                                                                                                                                                                                                                                                                                                                           | Westmere                                | Sandy Bridge                                     | Ivy Bridge                                   | Haswell                                     |
| NEW Intel <sup>®</sup><br>Microarchitecture<br>(Nehalem)                                                                                                                                                                                                                                                                          | Intel<br>Microarchitecture<br>(Nehalem) | NEW Intel<br>Microarchitecture<br>(Sandy Bridge) | Intel<br>Microarchitecture<br>(Sandy Bridge) | NEW Intel<br>Microarchitecture<br>(Haswell) |
| ТОСК                                                                                                                                                                                                                                                                                                                              | TICK                                    | ТОСК                                             | TICK                                         | ТОСК                                        |
| <ul> <li>Enhanced version of Intel's 22nm Process Technology</li> <li>22nm Tri-Gate transistors enhanced to reduce leakage current 2-3X with the same frequency capability</li> <li>Haswell version of 22nm has 11 metal interconnect layers compared to 9 layers on Ivy Bridge to optimize performance, area and cost</li> </ul> |                                         |                                                  |                                              |                                             |
| Haswell builds on innovations in 2 <sup>nd</sup> and 3 <sup>rd</sup> Generation<br>Intel <sup>®</sup> Core <sup>™</sup> i3/i5/i7 Processors (Sandy Bridge/Ivy<br>Bridge) with optimized Intel process technology!                                                                                                                 |                                         |                                                  |                                              |                                             |



**Power Efficiency and Management** 

FIVR – Fully Integrated Voltage Regulator

Cache Hierarchy and Interconnects

Gfx/Media

Intel<sup>®</sup> Microarchitecture (Haswell): Core

ISA



# Power Efficiency: Maximizing Power-Limited Performance

- Extended operating range
  - Increased Turbo
  - New C-states, improved latency
  - Power efficient features: better than voltage / frequency scaling
  - Continued focus on gating unused logic and low-power modes
  - Optimized manufacturing and circuits
- Independent frequency domains
  - Cores separated from LLC+Ring for fine-grained control
  - Power Control Unit dynamically allocates budget when power-limited
  - Prioritization based on run-time characteristics selects domain with the highest performance return



Performance



### **Haswell Power Management Innovation**

- All day experiences
  - Improving power efficiency for active workloads
- Evolutionary improvements
- New extremely low-power active state
  - 20x improvement from prior generation
  - Enables significant improvement in realizable battery life
  - Automatic, continuous, fine-grained, transparent to well written SW
  - Leverages learnings from phone & tablet development



#### **Resume Time**

#### Everything that is not needed is turned off!

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>





**Power Efficiency and Management** 

FIVR – Fully Integrated Voltage Regulator

Cache Hierarchy and Interconnects

Gfx/Media

Intel<sup>®</sup> Microarchitecture (Haswell): Core

ISA







# **FIVR: Platform Goodness**

#### **Ivy Bridge**

- Back is all power
- Large inductors, butterfly mounted through board
- <u>5.4mm thick</u>

#### Haswell

- Backside bare
- Small inductors & caps & 75% fewer
- Space for 10% larger battery
- <u>3.4mm thick</u>



*2mm thinner;* ~\$5 *cheaper; space for 10% larger battery* 





Power Efficiency and Management

FIVR – Fully Integrated Voltage Regulator

Cache Hierarchy and Interconnects

Gfx/Media

Intel<sup>®</sup> Microarchitecture (Haswell): Core

ISA



## **Cache, Interconnect and System Agent**

- More access bandwidth per slice of shared LLC
  - New dedicated pipelines handle data and non-data accesses independently
- Improved load balancing to System Agent
  - Better credit-based management more efficiently shares resources
- Improved DRAM write throughput
  - Deeper pending queues: more decoupling, better scheduling
- Lower power, better efficiency
  - Focused effort to reduce idle and active power (next section)





### Large eDRAM Cache

- Haswell introduces configurations with large graphics & large cache
- Cache attributes
  - High throughput and low latency
  - Flat latency vs. sustained bandwidth curve
  - Fully shared between Graphics, Media, and Cores for very efficient multi-media computing





### Large Caches in Graphics Workloads

- Intra-frame
  - Intra-render pass capture spatial and temporal locality within a surface
    - Captured in moderate cache sizes (1-8MB LLC). SNB Si shows 20-30% speedup
  - Inter-render pass capture a full surface from generation to subsequent consumption (shadow maps, render targets)
    - Captured in big cache sizes (16-64+MB LLC). CRW Si shows 20-30% speedup
- Inter-frame
  - Capture texture reuse across frames due to continuity between frames





# **Large Cache Performance and Latency**



### Pre-production system measurements, product measurements may vary.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go





Power Efficiency and Management

FIVR – Fully Integrated Voltage Regulator

Cache Hierarchy and Interconnects

#### **Gfx/Media**

Intel<sup>®</sup> Microarchitecture (Haswell): Core

ISA



#### Haswell Processor Graphics Architecture Building Blocks



#### Sets the stage for Scale-up!!

#### Scalable Architecture partitioned into 6 domains:

- 1. Global Assets: Geometry Front-end up to Setup
- Slice Common: Rasterizer, Level 3 Cache (L3\$) and Pixel Back-end
- 3. Sub-Slice: Shaders (EUs), Instruction Caches (IC\$) and Samplers
  - Scalable slices for performance and GFlop tuning
- 4. Multi-Format Video CODEC Engine (MFX)
- 5. Video Quality Enhancement Engine
- 6. Displays

### Video Codec

Introducing hardware-based SVC (Scalable Video Coding) codec

- Allowing single encoded bit-stream for heterogeneous devices
- Key enabler for multi-participant video conferencing
- MJPEG (Motion JPEG) hardware decoder
- Enabling low power HD video conferencing for USB2 webcam
- MPEG2 hardware encoder
- DVD creation
- DLNA streaming

4Kx2K video playback

Continue to drive encoder quality

 Introduced through the encoding modes in Media SDK



Haswell adds newer codec on top existing codecs in 3<sup>rd</sup> Generation Intel<sup>®</sup> Core<sup>™</sup> processors



# **High Quality Video Processing**

Dedicated video processing on newly designed Video Quality Engine (VQE)

Haswell supports an extensive suite of video processing functions including:

- De-Noise (DN)
- De-Interlace (DI)
- Film-mode Detection (FMD)
- Skin Tone Detection(STD)
- Skin Tone Enhancement (STE)
- Total Color Control (TCC)
- Adaptive Contrast Enhancement (ACE)
- Advanced Video Scalar (AVS)
- Gamut Compression (GC)
- Gamut Expansion (GE)<sup>1</sup>
- Skin Tone Tuned Image Enhancement Filter<sup>1</sup>
- Frame Rate Conversion (FRC)<sup>1</sup>
- Image Stabilization (IS)<sup>1</sup>

<sup>1</sup>New on Haswell



*Higher quality video at lower power!* 



#### Media: Quick Sync Video Performance and Power

- 4-12x real-time transcode at various quality modes
- 10-hour video playback time on latest Apple MacBook Air
- Multi-stream 4K decode
- > real-time 4K Encode



\*Measurements based on Intel Demo Clip in Cyberlink Media Espresso Fast Conversion Mode

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go http://www.intel.com/performance





Power Efficiency and Management

FIVR – Fully Integrated Voltage Regulator

Cache Hierarchy and Interconnects

Gfx/Media

Intel<sup>®</sup> Microarchitecture (Haswell): Core

ISA



### **Haswell Core at a Glance**



#### Next generation branch prediction

• Improves performance and saves wasted work

#### **Improved front-end**

- Initiate TLB and cache misses speculatively
- Handle cache misses in parallel to hide latency
- Leverages improved branch prediction

#### **Deeper buffers**

- Extract more instruction parallelism
- More resources when running a single thread

#### More execution units, shorter latencies

• Power down when not in use

#### More load/store bandwidth

- Better prefetching, better cache line split latency and throughput, double L2 bandwidth
- New modes save power without losing performance

#### No pipeline growth

- Same branch mis-prediction latency
- Same L1/L2 cache latency



#### **Haswell Buffer Sizes**

#### Extract more parallelism in every generation

|                        | Nobolom   | Sandy Pridas | Haswell | ) |
|------------------------|-----------|--------------|---------|---|
|                        | Nehalem   | Sandy Bridge | naswell |   |
| Out-of-order<br>Window | 128       | 168          | 192     |   |
| In-flight Loads        | 48        | 64           | 72      |   |
| In-flight Stores       | 32        | 36           | 42      | 1 |
| Scheduler Entries      | 36        | 54           | 60      |   |
| Integer Register File  | N/A       | 160          | 168     | 1 |
| FP Register File       | N/A       | 144          | 168     |   |
| Allocation Queue       | 28/thread | 28/thread    | 56      |   |



Intel® Microarchitecture (Haswell); Intel® Microarchitecture (Nehalem); Intel® Microarchitecture (Sandy Bridge)

### **Haswell Execution Unit Overview**



### FMA (Floatingpoint Multiply Add)



| Latency (clks) | Prior<br>Gen | New<br>Haswell | Ratio |
|----------------|--------------|----------------|-------|
| MuIPS, PD      | 5            | 5              |       |
| AddPS, PD      | 3            | 3              |       |
| Mul+Add /FMA   | 8            | 5              | 1.6   |

- 2 new FMA units provide 2x peak FLOPs/cycle of previous generation
- 2X cache bandwidth to feed wide vector units
  - 32-byte load/store for L1
  - 2x L2 bandwidth
- 5-cycle FMA latency same as an FP multiply

#### FMA provides improved accuracy and performance

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel<sup>®</sup> Microarchitecture (Haswell); Intel<sup>®</sup> Microarchitecture (Sandy Bridge); Intel<sup>®</sup> Microarchitecture (Meron); Intel<sup>®</sup> Microarchitecture (Banias)



### **Core Cache Size/Latency/Bandwidth**

| Metric                       | Nehalem                                            | Sandy Bridge                                      | Haswell                                           |
|------------------------------|----------------------------------------------------|---------------------------------------------------|---------------------------------------------------|
| L1 Instruction Cache         | 32K, 4-way                                         | 32K, 8-way                                        | 32K, 8-way                                        |
| L1 Data Cache                | 32K, 8-way                                         | 32K, 8-way                                        | 32K, 8-way                                        |
| Fastest Load-to-use          | 4 cycles                                           | 4 cycles                                          | 4 cycles                                          |
| Load bandwidth               | 16 Bytes/cycle                                     | 32 Bytes/cycle<br>(banked)                        | 64 Bytes/cycle                                    |
| Store bandwidth              | 16 Bytes/cycle                                     | 16 Bytes/cycle                                    | 32 Bytes/cycle                                    |
| L2 Unified Cache             | 256K, 8-way                                        | 256K, 8-way                                       | 256K, 8-way                                       |
| Fastest load-to-use          | 10 cycles                                          | 11 cycles                                         | 11 cycles                                         |
| Bandwidth to L1              | 32 Bytes/cycle                                     | 32 Bytes/cycle                                    | 64 Bytes/cycle                                    |
| L1 Instruction TLB           | 4K: 128, 4-way<br>2M/4M: 7/thread                  | 4K: 128, 4-way<br>2M/4M: 8/thread                 | 4K: 128, 4-way<br>2M/4M: 8/thread                 |
| L1 Data TLB                  | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: fractured | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: 4, 4-way | 4K: 64, 4-way<br>2M/4M: 32, 4-way<br>1G: 4, 4-way |
| L2 Unified TLB               | 4K: 512, 4-way                                     | 4K: 512, 4-way                                    | 4K+2M shared:<br>1024, 8-way                      |
| All caches use 64-byte lines |                                                    |                                                   |                                                   |

27 Intel<sup>®</sup> Microarchitecture (Haswell); Intel<sup>®</sup> Microarchitecture (Sandy Bridge); Intel<sup>®</sup> Microarchitecture (Nehalem)



Power Efficiency and Management

FIVR – Fully Integrated Voltage Regulator

Cache Hierarchy and Interconnects

Gfx/Media

Intel<sup>®</sup> Microarchitecture (Haswell): Core

#### ISA



#### **Haswell New Compute Instructions**

#### Intel<sup>®</sup> Advanced Vector Extensions

- 2 (Intel<sup>®</sup> AVX2)
- Includes
  - 256-bit Integer vectors
  - FMA: Fused Multiply-Add
  - Full-width element permutes
  - Gather
- Benefits
  - High performance computing
  - Audio & Video
  - Games
- New Integer Instructions
  - Indexing and hashing
  - Cryptography
  - Endian conversion MOVBE

|                 | Instruction Set | SP FLOPs<br>per cycle | DP FLOPs<br>per cycle |
|-----------------|-----------------|-----------------------|-----------------------|
| Nehalem         | SSE (128-bits)  | 8                     | 4                     |
| Sandy<br>Bridge | AVX (256-bits)  | 16                    | 8                     |
| Haswell         | AVX2 & FMA      | 32                    | 16                    |

| Group                                       | Instructions                              |
|---------------------------------------------|-------------------------------------------|
| Bit Field Pack/Extract                      | BZHI, SHLX, SHRX,<br>SARX, BEXTR          |
| Variable Bit Length<br>Stream Decode        | LZCNT, TZCNT, BLSR,<br>BLSMSK, BLSI, ANDN |
| Bit Gather/Scatter                          | PDEP, PEXT                                |
| Arbitrary Precision<br>Arithmetic & Hashing | MULX, RORX                                |

• Full Instruction Specification Available at: <u>http://software.intel.com/en-us/avx/</u>





# Cryptography protects nearly all data and transactions you want to keep secure







HLE: Hardware Lock Elision –

XACQUIRE/XRELEASE

Bringing Transactional Memory to the Mainstream



### **TSX Evaluation on HPC Workloads**



Substitute atomic operations, locks, and non-blocking sync. with RTM

Average 1.41x speedup with 8 threads

#### Workloads benefit from RTM by

- 1. Exploiting concurrency within a critical section (nufft)
- 2. Reducing the synchronization cost (**ssca2, physicsSolver, nufft, histogram**)
- 3. Replacing complex non-blocking sync. w/ regular memory ops (canneal)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go white://www.intel.com/performance



### Virtualization on Haswell with Intel® VT

Substantially improved guest/host transition times

New Accessed and Dirty bits for Extended Page Tables (EPT) eliminates major cause of vmexits

Overhauled TLB invalidations – lower latency, less serialization

New VMFUNC instruction enables hyper-calls without a vmexit

Intel<sup>®</sup> VT-d adds 4-level page walks to match Intel<sup>®</sup> VT-x



Intel VT-x Roundtrip over Generations





Power Efficiency and Management

FIVR – Fully Integrated Voltage Regulator

Cache Hierarchy and Interconnects

Gfx/Media

Intel<sup>®</sup> Microarchitecture (Haswell): Core

ISA



### Wrap Up!

- **Huge family**: SOC methodology, common architecture
- Low power platform: 20x idle power reduction, low power IO (I2C, SDIO, I2S, UART), Link power management (USB, PCIe, SATA)
- Large eDRAM Cache
- **Platform**: PSR (Panel Self Refresh)
- **FIVR**: Fully Integrated Voltage Regulator
- **Core**: FMA (Floating-point Multiply Add), 2x Cache BW, TSX (Transaction Synchronization Extention)
- Graphics: 2x in Ultrabooks, OpenCL 1.2, DX 11.1, OpenGL 4.0
- Media: 5x faster at 0.5x power

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go white://www.intel.com/performance

