# ··II··II·· CISCO

# The QFP Packet Processing Chip Set



Will Eatherton (Speaker), Don Steiss (Speaker), James Markevitch Cisco Systems, Cisco Development Organization

### Agenda

- Overview of Cisco's ASR-1000 Quantum Flow Processor
- Architecture and Implementation Tradeoffs
- Software Development and Debug Environments
- Silicon Details and Design Methodology
- Results
- Summary

# **ASR 1000 Building Blocks**



- 3 Chassis types
   6 / 4 / 2 RU
- RP (Route Processor) Handles control plane traffic Manages system
- ESP/FP (Forwarding Processor) Handles forwarding plane traffic
- Shared Port Adapter Carrier Card Houses the SPAs
- SPAs

Provide interface connectivity

 Centralized Forwarding Architecture All traffic flows through the ESP/FP

# **ASR 1000 Software Architecture – IOS XE**

 IOS XE is: IOSd

> plus IOS XE Services/API's plus QFP Datapath Software

- IOS runs as its own Linux process for control plane (Routing, SNMP, CLI etc). Linux kernel with multiple processes running in protected memory
- QFP Datapath Software

Fully multiprocessor code base covering range of features across Security, Voice, Deep packet inspection

Code base is ANSI-C, written to run based on packet loop and with function libraries for OS like services (Memory Mgmt, Timers, ..)





## **QFP10 Architecture**



## **Architecture and Implementation Tradeoffs**

#### Programming Environment

Assembly code vs. HLL: ANSI C with typical C runtime system for portability and productivity. Assembly coded Hardware Abstraction Layer (HAL).

#### Instruction Set Architecture and Implementation

ISA make vs. buy: off-the-shelf ISA to accelerate production software development.

Implementation make vs. buy: custom microarchitecture and circuits to improve power, performance and area.

#### Latency Hiding

Implicit vs. explicit mechanisms: Both; threading is implicit, non -blocking messages and data prefetch are explicit.

# **Architecture Tradeoffs – Continued**

#### Memory Subsystem

Ordering: intentionally weak. Loads and stores have no guaranteed MP ordering. Barriers, indivisible operations and serialization are provided. Atomic ordering facilities are provided by special resources.

D\$ parameters: enough capacity to cover parts of the working set with high locality, allocation policies to avoid cache pollution.

I\$ parameters: a large as possible and large second level cache bandwidth to reduce thread performance interference.

#### Processor and Resource Communication

Address mapping and/or message passing: message passing hardware infrastructure with an address mapped layer visible in the programming model.

# **Architecture Tradeoffs – Continued**

Accelerators: what goes in hardware ?

Message passing coprocessor: performance (latency hiding), code size, encapsulation.

Scheduling: performance, encapsulation.

Crypto: performance (parallelism), stable algorithms.

Lookup/Classification/Deep Packet Inspection algorithms: performance (parallelism), stable algorithms.

# **Traffic Manager**

- 128K queues and the ability to set Max/Min/Excess BW
- Two Priority Queues can be enabled for each QoS policy applied.
- The number and makeup of layers inside the QoS policy are flexible. The possible hierarchies are not tied to any existing hierarchies used in networks today.
- The traffic manager can schedule multiple levels of hierarchy in one pass through the chip.

Queuing operations can be split across multiple passes through the chipset (for example, ingress shaping or queuing followed by egress queuing).

#### Software Code Development & Debug Environments

 Simulation, code debug and hardware debug software components integrated in the Eclipse platform.

Multiple speed vs. accuracy processor simulation options

Compiler, assembler, linker, code debugger

Chipset state inspection (registers, interrupt sources)

- Enabled production code development and performance tradeoffs in parallel with hardware development.
- The same environment is available for hardware analysis and debug running on the router's control plane processors.

# **QFP Code Debug Environment**

| 3* 🔛 🎰   🎧   莎* Q * Q_*   🙋 🖋   🔂 *   월 * 휑 * 😓 두 수 *                                |                                   |            |                                                                                                          | 😭 🏇 Debug 🗟 uCode                                |
|--------------------------------------------------------------------------------------|-----------------------------------|------------|----------------------------------------------------------------------------------------------------------|--------------------------------------------------|
| Debug X 🎽 🏟 🐠 🕪 🖩 📕 🕼 🍖 🧟 🔍 🐟 🚓 🛒 🕩 🏱 🗖 🗖                                            | IIII Registers 🕱                  | 約 🕫 🖻 🏱 🗖  | □ & Expressions 🛛                                                                                        | <u></u> 4 E X % <sup>∨</sup> <sup>□</sup>        |
| ingress_demo[simulation] [Simulation]                                                | Name                              | Value      | **** "psv_ipv4_pbr_end_sim" = 0x801e146                                                                  | cheline_0 = '\0' <repeats 2<="" td=""></repeats> |
| ₩ Simulation 9/10/06 1:39 PM (Suspended)                                             | D MAROAR31                        | - Tulue    | ▼ <sup>×+y</sup> "pkt_state-> switching_state"                                                           |                                                  |
| ▽ @ Core [0]                                                                         | ▷ ∰A0A15                          |            | Þ 🖻 infra                                                                                                |                                                  |
| ∇ n <sup>®</sup> Context [0] (Suspended)                                             | D ≜ter Misc                       |            | - Common                                                                                                 |                                                  |
| E = 2 main() at /vob/cppsw/platform/c12000/ingress/c12000_ingress.c:35 0x801c8c27    | ♦ Minise                          |            | D D D D D D D D D D D D D D D D D D D                                                                    |                                                  |
| ▷ = 1_start() 0x80020075                                                             | ✓ ∰ Coprocessor (Core 0, Context) | 0)         | (x)= common_flags = 0                                                                                    |                                                  |
| ✓ <sup>®</sup> Context [1] (Suspended)                                               | in coprocessor (core of context   | 0          | (%)= input_interface = 0                                                                                 |                                                  |
| $b \equiv 1\_start() 0x8002006c$                                                     | 1010 timer                        | 661        | ▷ ➡ input_uidb_config = 0x00000000                                                                       |                                                  |
| ✓ Core [1]                                                                           | an unen andom                     | 1658605825 | <pre>&gt; input_uidb_coning = 0x00000000</pre>                                                           |                                                  |
|                                                                                      |                                   |            |                                                                                                          |                                                  |
| ⊽ <sup>®</sup> Context [0] (Suspended)                                               | sist prority                      | 0          |                                                                                                          |                                                  |
| ▷                                                                                    | 1658605825                        |            | (A)= inktype = .                                                                                         |                                                  |
| ▽ 🖞 Context [1] (Suspended)                                                          |                                   |            |                                                                                                          |                                                  |
| ς 🔍 🔍   📶 0   ₹ 10580   📫 10580   🖬 🖄 🖾                                              |                                   |            | tos=fr_control = .                                                                                       |                                                  |
|                                                                                      |                                   |            | ⇔ I2_overhead = .                                                                                        |                                                  |
| 1000 2000 3000 4000 5000 6000 7000 8000 9000 100(1058                                |                                   |            |                                                                                                          | 1                                                |
| · · · · · · · · · · · · · · · · · · ·                                                |                                   |            |                                                                                                          |                                                  |
| c12000_ingress.c 🛿 🔪 🖻 stack.c                                                       |                                   | - 0        | Disassembly 🛛                                                                                            | =                                                |
| */                                                                                   |                                   | *          | {                                                                                                        |                                                  |
| int main (void)                                                                      |                                   |            | 0x801c8c24 <main>: entry a1, 48</main>                                                                   |                                                  |
| {<br>pkt_state_t *pkt_state;                                                         |                                   |            | <pre>pkt_state_t *pkt_state;</pre>                                                                       |                                                  |
| pkt_state_t "pkt_state,                                                              |                                   |            | <pre>pkt_state = thread_get_pkt_state();</pre>                                                           |                                                  |
| <pre>pkt_state = thread_get_pkt_state();</pre>                                       |                                   |            | ♦ 0x801c8c27 <main+3>: const16 a8, 0x8006</main+3>                                                       |                                                  |
|                                                                                      |                                   |            | 0x801c8c2a <main+6>: const16 a8, 0xfc98</main+6>                                                         |                                                  |
| <pre>marmot_init();</pre>                                                            |                                   |            | 0x801c8c2d <main+9>: callx8 a8<br/>0x801c8c30 <main+12>: s32i.n a10, a1, 0</main+12></main+9>            |                                                  |
| <pre>standard_init(pkt_state);</pre>                                                 |                                   |            | 0x801c8c50 (main+12). \$521.n alo, al, 0                                                                 |                                                  |
|                                                                                      |                                   | - 1        | <pre>marmot_init();</pre>                                                                                |                                                  |
| //!!! TODO platform init.                                                            |                                   |            | 0x801c8c32 <main+14>: const16 a8, 0x801c</main+14>                                                       |                                                  |
| <pre>ingress_main(pkt_state);</pre>                                                  |                                   | 4          | 0x801c8c35 <main+17>: const16 a8, 0x72ac</main+17>                                                       |                                                  |
| ingress_main(pkt_state);                                                             |                                   |            | Address breakpoint: c12000_ingress.c [address: 0x801c                                                    | 8038                                             |
| /*                                                                                   |                                   |            | <pre>standard_init(pkt_state);</pre>                                                                     |                                                  |
| * should never get to this return, but make the compiler happy                       |                                   |            | 0x801c8c3b <main+23>: 132i.n a10, a1, 0</main+23>                                                        |                                                  |
| */                                                                                   |                                   |            | 0x801c8c3d <main+25>: const16 a8, 0x8007</main+25>                                                       |                                                  |
| <pre>return 0; } /* endFunction main */</pre>                                        |                                   |            | 0x801c8c40 <main+28>: const16 a8, 0x841c<br/>0x801c8c43 <main+31>: callx8 a8</main+31></main+28>         |                                                  |
| j / chulunction muin /                                                               |                                   |            | oxooicocio (maintoi), carixo ao                                                                          |                                                  |
| FORCE_INLINE                                                                         |                                   |            | //!!! TODO platform init.                                                                                |                                                  |
| <pre>pkt_handler_t pal_rx_process (pkt_state_t *pkt_state, uint32_t rx_channel</pre> | .)                                |            |                                                                                                          |                                                  |
| <pre>switch (rx_channel) {</pre>                                                     |                                   |            | ingress_main(pkt_state);<br>0x801c8c46 <main+34>: l32i.n a10, a1, 0</main+34>                            |                                                  |
| case W10G_RX_CHAN_MARMOT: {                                                          |                                   |            | 0x801c8c46 <main+34>: 1321.n al0, al, 0<br/>0x801c8c48 <main+36>: const16 a8, 0x8002</main+36></main+34> |                                                  |
| BRANCH_FAST                                                                          |                                   |            | 0x801c8c4b <main+39>: const16 a8, 0x1e3c</main+39>                                                       |                                                  |
| <pre>return marmot_rx_process(pkt_state, rx_channel);</pre>                          |                                   |            | 0x801c8c4e <main+42>: callx8 a8</main+42>                                                                |                                                  |
| break;                                                                               |                                   |            | 10                                                                                                       |                                                  |
| 1                                                                                    |                                   |            | /*<br>* should never get to this return, but                                                             | ut make the compiler happy                       |
| case W10G_RX_CHAN_RPO_HI:                                                            |                                   |            | */                                                                                                       |                                                  |
| CASE WING RY CHAN RPO LO                                                             |                                   | *          | return O.                                                                                                |                                                  |

### **QFP Hardware Debug Environment Example**

| Remote Proxy Debug [demo-proxy – Full Control] 3/2/07 11:02 AM |            |                     |                                                   |                                     |  |  |  |
|----------------------------------------------------------------|------------|---------------------|---------------------------------------------------|-------------------------------------|--|--|--|
| lame                                                           | Address    | Datum               | Full name                                         | Description                         |  |  |  |
| 7 🗁 Physical                                                   |            |                     |                                                   |                                     |  |  |  |
| ▽ 🔀 hedp                                                       |            |                     | HED CSR Address Map                               | Address Map of HED CPU registe      |  |  |  |
| ▷ 醚 hed_halted_in_63_0_leaf_int                                |            |                     | Hed Halted_in_63_0 Leaf Interrupt Register File   | This is the set of Hed Halted_in_6  |  |  |  |
| ▷ is hed_halted_in_127_64_leaf_int                             |            |                     | Hed Halted_in_127_64 Leaf Interrupt Register File | This is the set of Hed Halted_in_   |  |  |  |
| int_bed_halt_out_63_0_leaf_int_                                |            |                     | Hed Halt_out_63_0 Leaf Interrupt Register File    | This is the set of Hed Halt_out_6   |  |  |  |
| b ed_halt_out_127_64_leaf_int                                  |            |                     | Hed Halt_out_127_64 Leaf Interrupt Register File  | This is the set of Hed Halt_out_1   |  |  |  |
| ▷ 🥶 hed_elam_leaf_int                                          |            |                     | Hed Elam Leaf Interrupt Register File             | This is the set of Hed Elam leaf ir |  |  |  |
| ▷ 🥮 hed_leaf_leaf_int                                          |            |                     | Hed Leaf Leaf Interrupt Register File             | This is the set of Hed Leaf leaf in |  |  |  |
| ∽ 😸 hed_top_hier_int                                           |            |                     | Hed Top Hierarchical Interrupt Register File      | This is the set of Hed Top hierar   |  |  |  |
| ▽ 🔘 🖂 int_stat                                                 | 0x24044080 | 0x0000000000000025  | Hed Top Interrupt Status Register                 | This register contains the Hed To   |  |  |  |
| int_leaf                                                       | bit 0      | 1                   | An Interrupt Occurred in the HED's CSR Leaf       | Read the hed_leaf interrupt regis   |  |  |  |
| int_halted_in_63_0                                             | bit 1      | 0                   | Halted 63:0 input Interrupt                       | Halted 63:0 input from top level    |  |  |  |
| int_halted_in_127_64                                           | bit 2      | 1                   | Halted 127:64 input Interrupt                     | Halted 127:64 input from top le     |  |  |  |
| int_halt_out_63_0                                              | bit 3      | 0                   | Halt 63:0 output Interrupt                        | Halt 63:0 output to top level mo    |  |  |  |
| int_halt_out_127_64                                            | bit 4      | 0                   | Halt 127:64 output Interrupt                      | Halt 127:64 output to top level i   |  |  |  |
| int_elam                                                       | bit 5      | 1                   | ELAM Interrupt                                    | ELAM Block Interrupt                |  |  |  |
| ▷ Int_en_rwlc                                                  | 0x24044088 | 0x00000000000000000 | Hed Top Interrupt Enable Register RW1C            | This register contains the interrup |  |  |  |
| int_en_rw1s                                                    | 0x24044090 | 0x00000000000000000 | Hed Top Interrupt Enable Register RW1S            | This register contains the interru  |  |  |  |

# **QFP10** Chipset



#### **Multi-Core Packet Processor**

- 1.2 GHz/400 MHz
- 40 custom multi-threaded CPUs
- TI 90nm, 8-layers metal
- 19.54 x 19.54 (382 mm<sup>2</sup>)
- 307 million transistors
- 20 Mb SRAM
- 1019 I/O, including 800 MHz DDR



#### **Traffic Manager and Interface Chip**

- 400 MHz
- Buffering, 200K queues, hardware HQF scheduling
- TI 90nm, 8-layers metal
- 19.0 x 17.48 (332 mm<sup>2</sup>)
- 522 million transistors
- 70 Mb SRAM
- 1318 I/O, including 800 MHz DDR

#### **Packet Processing Engine (PPE) Structure**



# Hardware Design Methodology

 Customer Owned Tooling (COT) with GDSII handoff to the foundry partner.

Cell library designed and characterized by Cisco

SRAM compiler and I/O IP from Texas Instruments.

Extensive crosschecking at Cisco and Texas Instruments.

Synthesis/P&R outside of processor array

Multiple commercial synthesis and physical design tools.

In-house tools for RTL code generation and documentation.

#### Hardware Design Methodology - Processor

- Cell-based to leverage ASIC tools.
- Static CMOS with selective use of domino circuits
- Schematic design entry with physical specifications in instance names for fast, deterministic placement.
- Polygon-level artwork in critical modules.
- Autorouted signals with many pre-routes above the cell level
- Multiple functional reference models and equivalence checking

# **Processor Design Verification**

- Strong block level verification methodology
- Heavy use of code generators and constraint solving
- Processor verification statistics:

1,903,282 RTL simulation runs 95,185,665,935 total clock cycles 17,421 total test failures

#### **Processor Design Verification – Continued Failures by Test Source**



# **Results – Shmoo at Room Temp**



# **Results - Functional**

• On rev 1.0 silicon:

Silicon delivered for system integration, January 2007

First packets through production software, February 2007

Customer testing, August 2007

- Product Launch March 4, 2008
- All "mission mode" registers and instructions in the ISA are used in production software.



- Router Architecture today involves partitioning of software across multiple CPU complexes, Multiple Cores per CPU, Multiple threads per Core
- Processor architecture in networking is still evolving
- Many architectural and implementation tradeoffs are a result of software engineering complexity that rivals hardware complexity and often 10x more staffing.

#