

# PCI Express 3.0 Overview



Jasmin Ajanovic Sr. Principal Engineer Intel Corp.

HotChips - Aug 23, 2009

# Agenda

- PCI e Architecture Overview
- PCI e 3.0 Electrical Optimizations
- PCI e 3.0 PHY Encoding and Challenges
- New PCI e Protocol Features
- Summary & Call to action



# **PCI Express\* (PCIe) Interconnect**

#### **Physical Interface**

- Point-to-point full-duplex
- Differential low-voltage signaling
- Embedded clocking
- Scaleable width & frequency
- Supports connectors and cables

#### Protocol

- Load Store architecture
- Fully packetized split-transaction
- Credit-based flow Control
- Virtual Channel mechanism

#### Advanced Capabilities

- Enhanced Configuration and Power Management
- RAS: CRC Data Integrity, Hot Plug, Advanced error logging/ reporting
- QoS and I sochronous support

### IO Trends

- Increase in IO Bandwidth
- **Reduction in Latency**
- **Energy Efficient Performance**

#### **Emerging Applications** Virtualization

- **Optimized Interaction between**
- Host & IO
- Examples: Graphics, Math,
- Physics, Financial & HPC Apps.

#### New Generations of PCI Express Technology



### PCIe Technology Roadmap



and products are subject to change without further notification

PCIe 1.x

PCIe 2.0

PCIe 3.0

**Improving Capabilities Every 3-4 Years!** 

### PCIe 3.0 Electrical Interface



### **PCI e 3.0 Electrical Requirements**

- Compatibility with PCI e 1.x, 2.0
- 2x payload performance bandwidth over PCI e 2.0
- Similar cost structure (i.e. no significant cost adders)
- Preserve existing data clocked and common clock architecture support
- Maximum reuse of HVM ingredients
  - FR4, reference clocks, etc.
- Strive for similar channel reach in high-volume topologies
  - Mobile: 8", 1 connector
  - Desktop: 14", 1 connector
  - Server: 20", 2 connectors



# **PCIe Gen3 Solution Space**



Source: Intel Corporation

- Solution space exists to satisfy 8GT/s client and server channels requirements
  - Power, channel loss and distortion much worse at 10GT/s
  - Similar findings by PCI-SIG members corroborated Intel analysis
- PCI-SIG approved 8GT/s as PCIe 3.0 bit rate



# **Enabling Factors for 8G**

- Scrambling permits 2x payload rate increase wrt. Gen2 with 8 GT/s data rate
  - Scrambling eliminates 25% coding overhead of 8b/10b
  - 8G chosen over 10G due to eye margin considerations

### • More capable Tx de-emphasis

- One post cursor tap and one pre cursor tap (2.5 and 5G has 1 post cursor tap)
- Six selectable presets cover most equalization requirements
- Finer Tx equalization control available by adjusting coefficients

#### Receiver equalization

- 1<sup>st</sup> order LE (linear eq.) is assumed as minimum Rx equalization
- Designs may implement more complex Rx equalization to maximize margins
- Back channel allowing Rx to select fine resolution Tx equalization settings

### BW optimizations for Tx, Rx PLLs and CDR

- PLL BW reduced, CDR (Clock Data Recovery) jitter tracking increased
- CDR BW > 10 MHz, PLL BW 2-4 MHz



### PCIe 3.0 Encoding/ Signaling



Q

### **Problem Statement**

#### • PCI Express\* (PCIe) 3.0 data rate decision: 8 GT/s

- High Volume Manufacturing channel for client/ servers
  - Same channels and length for backwards compatibility
  - Low power and ease of design avoid using complicated receiver equalization, etc.

### • Requirement: Double Bandwidth from Gen 2

- PCIe 1.0a data rate: 2.5 GT/s
- PCIe 2.0 data rate: 5 GT/s
  - Doubled the bandwidth from Gen 1 to Gen 2 by doubling the data rate
- Data rate gives us a 60% boost in bandwidth
- Rest will come from Encoding
  - Replace 8b/10b encoding with a scrambling-only encoding scheme when operating at PCIe 3.0 data rate
- Double B/W: Encoding efficiency 1.25 X data rate 1.6 = 2X

Challenge: New Encoding Scheme to cover

256 data plus 12 K-codes with 8 bits



# **New Encoding Scheme**

- Two levels of encapsulation
  - Lane Level (mostly 128/130)
  - Packet Level to identify packet boundaries
    - Point to where next packet begins
- Additive Scrambling only (no 8b/10b) to provide edge density
  - Data Packets scrambled
    TLP/ DLLP/ LIDL
  - Ordered Sets mostly not scrambled
  - Electrical Idle Exit Ordered Set resets scrambler (Recovery/ Config)



Source: Intel Corporation

### Scrambling with two levels of encapsulation



# Mapping of bits on a x1 Link





# Mapping of bits on a x4 Link





# **P-Layer Encapsulation: TLP**

| 7 | 4 3 | 0 15          | 14 8       | 23 20                 | 19 16 3             | 31 24        | 39 | 32                                |      | n-1                         | n-8           |
|---|-----|---------------|------------|-----------------------|---------------------|--------------|----|-----------------------------------|------|-----------------------------|---------------|
| L |     | STP<br>111) P | Len [10:4] | Frame<br>CRC<br>[3:0] | Seq<br>No<br>[11:8] | Seq No [7:0] | -  | —— TLP Payload (same format as 2. | D) — | LCRC<br>same for<br>as 2.0) | (4B,<br>ormat |

[Len[10:0]: length of the TLP in DWs, Frame CRC[4:0]: Check Bits covering Length[0:10], P: Frame Parity, No END]

#### • Length known from the first 3 Symbols

- First 4 bits are 1111 (bit[0:3] = 4'b1111)
- Bits 4:14 has the length of the TLP (valid values: 5 to 1031)\*
- Bits 15 and 20:23 is check bits to cover the TLP Length field
  - Primitive Polynomial (X4 + X + 1) protects 15 bit field
    - Provides double bit flip detection guarantee (length 11 bits + CRC 4 bits)
  - Odd parity covers the 15 bits (length 11 bits + CRC 4 bits)
    - Guaranteed detection of triple bit errors (over 16 bits)
- Sequence Number occupies bits 16:19 and 24:31
- TLP payload is from the 4th Symbol position (same as 2.0)
- No explicit END Check 1st Symbol after TLP for implicit END vs. an explicit EDB => Ensures triple bit flip detection
- All Symbols are scrambled/ de-scrambled



# **P-Layer Encapsulation: DLLP**



(DLLP Layout)

- Preserve DLLP layout of 2.0 spec
- First Symbol is F0h
- Second Symbol is ACh
- Next 4 Symbols (2 through 5) are the DLLP layout
- Next 2 Symbols (6 and 7): LCRC (identical to 2.0)
- No explicit END
- All Symbols are scrambled/ de-scrambled



## Ex: TLP// DLLP// IDLs in x8



Time



# **TLP Transmission in a X4 Link**





### **PCIe 3.0 Protocol Extensions**



### **Protocol Extensions**



internal errors and record multiple error logs



# **TLP Processing Hints (TPH)**



# **Transaction Processing Hints**





Background:

- Small I O Caches implemented in server platforms
  - Ineffective w/o info about intended use of IO data

#### Feature:

- TPH= hints on a transaction basis
  - Allocation & temporal reuse
- More direct CPU<->IO collaboration
  - Control structures (headers, descriptors) and data payloads

#### Benefits:

- Reduced access latencies
  - Improved data retention/allocation
- Reduced mem & QPI BW/ power
  - Avoiding data copies
- New applications
  - Comm adapters for HPC and DB clusters, Computational Accelerators,...

### Provides stronger coupling between Host Cache/Memory hierarchy and IO

### **Basic Device Writes**





# **Device Writes with TPH**



### **Basic Device Reads**





# **Device Reads with TPH**





# Atomic Operations (AtomicOps)



# **Synchronization**



#### Atomic Read-Modify-Write

- Atomic transaction support for Host update of main memory exists today
  - Useful for synchronization without interrupts
  - Rich library of proven algorithms in this area
- Benefit in extending existing interprocessor primitives for data sharing/ synchronization to PCI e interconnect domain
  - Low overhead critical sections
  - Non-Blocking algorithms for managing data structures e.g. Task lists
  - Lock-Free Statistics e.g. counter updates
- Improve existing application performance
  - Faster packet arrival rates create demand for faster synchronization
- Emerging applications benefit from Atomic RMW
  - Multiple Producer Multiple Consumer support
  - Example: Math, Visualization, Content Processing etc



PCI Express\* Device

# Atomic Read-Modify-Write (RMW)





### **Power Management Enhancements**

Dynamic Power Allocation(DPA) Optimized Buffer Flush (OBFF) Latency Tolerance Reporting (LTR)



### **Dynamic Power Allocation**

#### Background

- PCI e 1.x provided standard Device & Link-level Power Management
- PCI e 2.0 adds mechanisms for dynamic scaling of Link width/ speed
- No architected mechanism for dynamic control of device thermal/ power budgets

### Problem Statement

- Devices are increasingly higher consumers of system power & thermal budget
  - Emerging 300W Add-In Cards
- New Customer & Regulatory Operating Requirements
  - On-going Industry wide efforts e.g. ENERGY STAR\* Compliance
  - Battery Life/Enclosure Power Management
    - Mobile, Servers & Embedded Platforms



# **Dynamic Power Allocation (DPA)**



Enables New Platform Level Flexibility in Power/ Thermal Resource Management

# **Latency Tolerance Reporting**

Problem: Current Platforms PM policies guesstimate when devices are idle (e.g. w/inactivity timers)

- Guessing wrong can cause performance issues, or even HW failures
- Worst case: PM disabled to allow functionality at cost to power
- Even best case not good reluctance to power down leaves some PM opportunities on the table
  - Tough balancing act between performance / functionality and power

# Wanted: Mechanism for platform to tune PM based on <u>actual</u> device service requirements



# Latency Tolerance Reporting (LTR)



# **Optimized Buffer Flush//Fill**

Device Bus Master/Interrupt events



Wanted: Mechanism for <u>Align</u> Device Activity with Platform PM events

## **Optimized Buffer Flush//Fill (OBFF)**



### **Other Protocol Enhancements**

I D-based Transaction Ordering I O Page Fault Mechanism Resizable BAR Multicast



## **Transaction Ordering Enhancement**



ansaction streams

#### • Background:

- Strong ordering = = unnecessry stalls
- Transactions from different Requestors carry different IDs

#### Feature:

- New Transaction Attribute bit to indicate ID-based ordering relaxation
  - Permission to reorder transactions between different ID streams
- Applies to unrelated streams within:
  - MF Devices, Root Complex, Switches

#### Benefits:

– Improves latency/ power/ BW

**Reduces transaction latencies in the system.** 



# **IO Page Fault Mechanism**



#### <u>Background:</u>

- Emmerging trend: Platform Virtualization
- Increases pressure on memory resources making page "pinning" very expensive

#### Feature:

- Built upon PCI e Address Translation Services (ATS) Mechanism
- Notify I O devices when I O page faults occur
  - Device pause/resume on page faults
  - Faulted pages requested to be made available

#### Benefits:

 OS/ Hypervisor gets ability to maintain overall system performance

**Critical for future IO Virtualization application scaling.** 



#### **Resizable BAR & Multicast**



BAR == Base Address Register – PCI mechanism for mapping device memory into sys. address space

Improved platform addres space management -- solves current problems with gfx/accel

#### Multicast provides perf. scaling of existing apps (e.g. multi Gfx) -- opens new usages for PCIe in embedded space

**Multicast Address Route** 



# Summary

- 8.0 GT/ s silicon design is challenging but achievable
- Double B/W: Encoding efficiency 1.25 X data rate 1.6 = 2X
- Next Generation PCI e Protocol Extensions Deliver
  - Energy Efficient Performance,
  - Software Model Improvements and
  - Architecture Scalability

#### Specification Status:

- Rev 0.5 spec delivered to PCI SIG in Q1'09
- Rev 0.7 targeting Sept. '09 & Rev 0.9 early Q1'10

Continuous Improvement: Doubling Bandwidth & Improving Capabilities Every 3-4 Years!

## **Call to Action & Referrences**

- Contribute to the evolution of PCI Express architecture
  - Review and provide feedback on PCIe 3.0 specs
  - Innovate and differentiate your products with PCIe 3.0 industry standard
- Visit:
  - <u>www.pcisig.com</u> for PCI Express specification updates
  - <u>http://download.intel.com/technology/pciexpress/</u> <u>devnet/docs/PCIe3\_Accelerator-Features\_WP.pdf</u> for white-paper on PCIe Accelerator Features





# Backup



#### Example of a Eye As Seen At Receiver I nput Latch



Eye margins reflect CDR tracking and Rx equalization



## Scrambling vs. 8b/ 10b coding

- 8GT/ s uses scrambled data to improve signaling efficiency over 8b10b encoding used in 2.5GT/ s and 5GT/ s, yielding 2x payload data rate wrt. 5 GT/ s
- Unlike 8b10b a maximal length PRBS generated by an LFSR does not preserve DC balance
  - The average voltage level over a constant period of time varies slowly based on the pattern of the PRBS
  - In an AC coupled system this creates a slowly changing differential offset that that reduces eye height
- Different PRBS polynomials have different average run lengths through their pattern and so different peak differential offsets
  - There exists a best case PRBS23 polynomial yielding minimum DC wander of ~ 4.5 mVPP:  $x^{23} + x^{21} + x^{18} + x^{15} + x^7 + x^2$
- Large number of taps tends to break up long runs of 0s or 1s (a common case)
  - Pathological match between PRBS and data pattern have very low probability
  - Retry mechanism changes polynomial starting point to prevent pathological data pattern from failing repeatedly



#### **Gen3 Signaling: Error Detection & Recovery**

#### • Framing error is detected by the physical layer

- The first byte of a packet is not one of the allowed sets (e.g., TLP, DLLP, LIDL)
- Sync character is not 01 or 10
- Same sync character not present in all lanes after deskew
- CRC error in the length field of a TLP
- Ordered set not one of the allowed encodings or not all lanes sending the same ordered set after deskew (if applicable)
- 10 sync header received after 01 sync header without a marker packet in the 01 sync header OR received a marker packet in the 01 sync header and the subsequent sync header in any lane not 10

#### • Any framing error requires directing LTSSM to Recovery

- Stop processing any received TLP/ DLLP after error until we get through Recovery
- Block lock acquired with EIEOS
- Scrambler reset with each EIEOS

#### Error Detection Guarantees

- Triple bit flip detection within each TLP/ DLLP/ IDL/ OS



## **TLP Processing Hints (TPH)**



## **TPH Mechanism**



- Mechanism to provide processing hints on per TLP basis for Requests that target Memory Space
  - Enable system hardware (ex: Root-Complex) to optimize on a per TLP basis
  - Applicable to Memory Read/Write and Atomic Operations

| PH[1:0] | Processing<br>Hint                      | Usage Model                              |
|---------|-----------------------------------------|------------------------------------------|
| 00      | Bi-<br>directional<br>data<br>structure | Bi-Directional data structure            |
| 01      | Requestor                               | D*D*                                     |
| 10      | Target                                  | DWHR<br>HWDR                             |
| 11      | Target with<br>Priority                 | DWHR (Prioritized)<br>HWDR (Prioritized) |



# Steering Tag (ST)



|       | +0           |           | +1 |     |       | +2  |        |         | +3 |               |   |                      |
|-------|--------------|-----------|----|-----|-------|-----|--------|---------|----|---------------|---|----------------------|
|       | 765          | 4 3 2 1 0 | 7  | 654 | 3 2 1 | 0 7 | 654    | 432     | 10 | 7 6 5 4 3 2 1 | 0 | Memory Read or       |
|       |              |           |    |     |       | TTE | =      | ·       |    |               | • |                      |
| Byte0 | R Fmt        | Туре      | R  | TC  | R     | HDF | P Attr | AT      |    | Length        |   | AtomicOperation TLPs |
| Byte4 | Requestor ID |           |    |     |       | Tag |        | ST(7:0) |    |               |   |                      |

- ST: 8 bits defined in header to carry System specific Steering Tag values
  - Use of Steering Tags is optional 'No preference' value used to indicate no steering tag preference
  - Architected Steering Table for software to program system specific steering tag values



# **TPH Summary**

- Mechanism to make effective use of system fabric and improve system efficiency
  - Reduce variability in access to system memory
  - Reduce memory & system interconnect BW & power consumption
- Ecosystem Impact
  - Software impact is under investigation minimally may require software support to retrieve hints from system hardware
  - Endpoints take advantage only as needed  $\rightarrow$  No cost if not used
  - Root Complex can make implementation tradeoffs
  - Minimal impact to Switches
- Architected software discovery, identification, and control of capabilities
  - RC support for processing hints
  - Endpoint enabling to issue hints



## ID-Based Ordering (IDO)



## Review: PCI e Ordering Rules



- Maximum theoretical flexibility: All entries are "Y/ N"
- Traditional Relaxed Ordering (RO) enables A2 & D2 "Y/N" cases
  - AtomicOps ECR defines an RO-enabled C2 "Y/N" case
- ID-Based Ordering (IDO) enables A2, B2, C2, & D2 "Y/ N" cases



## **Motivation**

- RO works well for single-stream models where a data buffer is written once, consumed, and then recycled
  - Not OK for buffers that will be written more than once because writes are not guaranteed to complete in order issued
  - Does not take advantage of the fact that ordering doesn't need to be enforced between unrelated streams

#### • Conventional Ordering (CO) can cause significant stalls

- Observed stalls in the 10's to 100's of ns are seen
- Worst case behavior may see such stalls repeatedly for a Request stream
- Consider case of NIC or disk controller with multiple streams of writes:





## I DO: Perf Optimizations for Unrelated TLP Streams



- TLP Stream: a set of TLPs that all have the same originator
- Optimizations possible for unrelated TLP Streams, notably with:
  - Multi-Function device (MFD)/ Root Port Direct Connect
  - Switched Environments
  - Multiple RC Integrated Endpoints (RCIEs)
- I DO permits passing between TLPs in different streams
- Particularly beneficial when a Translation Agent (TA) stalls TLP streams temporarily



# **TLP Prefix**







- Emerging usage models require increase in header size to carry new information
  - Example: Multi-Root IOV, Extended TPH
- TLP Prefix mechanism extends the header sizes by adding DWORDs to the front of headers



## **Prefix Encoding**



- Base TLP Prefix Size 1 DW
  - Appended to TLP headers
- TLP Prefixes can stacked or repeated
  - More than one TLP Prefix supported
- Link Local Where routing elements may process the TLP for routing or other purposes.
  - Only usable when both ends understand and are enabled to handle link local TLP Prefix
  - ECRC not applicable
- End-End TLP Prefix
  - Requires support between the Requester, Completer and routing elements
  - End-End TLP Prefix not required to but is permitted to be protected by ECRC
    - If underlying Base TLP is protected by ECRC then End-End TLP Prefix is also protected by ECRC
  - Upper bound of 4DWORDs (16 Bytes) for End-End TLP Prefix
- Fmt field grows to 3 bits
  - New error behavior defined
  - Undefined Fmt and/or Type values results in Malformed TLP
  - "Extended Fmt Field Supported" capability bit indicates support for 3 bit Fmt
    - Support is recommended for all components (independent of Prefix support)



## **Stacked Prefix Example:**



- Link Local is first
  - Starts at 0
  - Type<sub>l 1</sub>
- End-End #1 follows Link Local 4
  - Starts at
  - Type<sub>F1</sub>
- End-End # 2 follows End-End # 1 8
  - Starts at
  - Type<sub>□</sub>
- PCI e Header follows End-End # 2
  - Starts at 12
- Switch routes using Link Local and PCI e Header
  - ... and possibly additional Link Local DWORDs - if more extension bits needed
  - Malformed TLP if don't understand
- Switch forwards End-End Prefixes unaltered
  - End-End Prefixes do not affect routing
  - Up to 4 DWORDs (16 Bytes) of End-End Prefix
- End-End Prefixes are optional
  - Different End-End Prefixes sequence are unordered
    - affects ECRC but does not affect meaning
  - Repeated End-End Prefix sequence must be ordered
    - e.g. 1st Extended TPH vs. 2nd Extended TPH attribute
    - meaning of this is defined by each End-End Prefix



# Multicast



# Multicast Motivation & Mechanism Basics

- Several key applications benefit from Multicast
  - Communications backplane (e.g. route table updates, support of IP Multicast)
  - Storage (e.g., mirroring, RAID)
  - Multi-headed graphics

#### • PCI e architecture extended to support address-based Multicast

- New Multicast BAR to define Multicast address space
- New Multicast Capability structure to configure routing elements and Endpoints for Multicast address decode and routing
- New Multicast Overlay mechanism in Egress Ports allow Endpoints to receive Multicast TLPs without requiring Endpoint Multicast Capability structure

#### • Supports only Posted, address-routed transactions (e.g., Memory Writes)

- Supports both RCs and EPs as both targets and initiators
- Compatible with systems employing Address Translation Services (ATS) and Access Control Services (ACS)
- Multicast capability permitted at any point in a PCI e hierarchy



#### **Multicast Example**





## **Multicast Memory Space**



