# The ARM10 Family of Advanced Microprocessor Cores

Stephen Hill ARM Austin Design Center





THE ARCHITECTURE FOR THE DIGITAL WORLD<sup>™</sup>

Hot Chips 13

1

## Agenda



THE ARCHITECTURE FOR THE DIGITAL WORLD<sup>™</sup>

#### **ARM1020E Overview**

Max frequency: 400MHz

- 0.9*V*, worst case
- TSMC 0.13um LV

MIPS/MHz: 1.25

- 500 MIPS @ 400MHz
- Dhrystone 2.1

#### Active power consumption: 0.51mA/MIPS

- Room Temp / Typical / 1.1V
- Average when running Dhrystone 2.1

#### Area

- ARM1022E (2x16KB): 6.9mm<sup>2</sup>
- ARM1020E (2x32KB): 10.3mm<sup>2</sup>



#### **ARM10200 System Overview**



### **ARM10E** Microarchitecture

- 64-bit instruction and data interfaces
- Static branch prediction with branch folding
- Parallel load/store pipeline
  - Dedicated machine for LDM/STM execution; all but the first cycle of these instructions are hidden if no dependencies are encountered
- Parallel execution of multi-cycle coprocessor operations
- Multiply 16 bits per cycle
  - 1-3 cycle throughput and 2-4 cycle latency
  - No data-dependent MUL cycle counts



## **ARM7 Pipeline versus ARM10**

#### ARM7TDMI



THE ARCHITECTURE FOR THE DIGITAL WORLD<sup>™</sup>

## **ARM9 Pipeline versus ARM10**

#### **ARM9TDMI**



THE ARCHITECTURE FOR THE DIGITAL WORLD<sup>™</sup>

## **ARM1020E Memory System**

- Instruction & Data Caches
  - 32Kbyte Instruction and Data caches
  - Virtually addressed, 64-way set-associative, 32-byte lines, 64-bit R/W
  - Configurable for Write Through or Write Back operation
  - Lockable by line (1/64 of the cache)
- MMUs
  - Two fully associative (I and D) 64-entry TLBs
  - Lockable by entry
  - Support for software loadable TLBs
- Write Buffer
  - Eight 64-bit entries, plus 32-byte cache line castout buffer
- AHB Bus Interface
  - 64-bit wide data transfers, split transactions
  - Multi-layer AHB support (separate I and D-side system interfaces)



## **ARM1020E Memory System**

Performance features:

- Critical word first
- Non-blocking data cache
- Hit-under-miss (H-U-M)
- Data cache streaming (forwarding) from linefills

Data cache store merging into linefills



#### **ARM1020E Interrupts**

- Interrupts taken in Execute stage
- Fast interrupt mode:
  - First load miss stops further memory ops but not other instructions
  - Limit write buffer depth
- Recommended measures for fast interrupt response:
  - Lock handler code into Caches and TLB
  - Set data cache to write-though (no cast outs)
  - Limit LDM length to 9 registers (spans only 2 cache lines)



#### **ARM1020E Interrupts**

- Worst case Interrupt response time to enter handler (G:H Clock 1:1)
  - Worst case of outstanding memory operations (LDM just started)
  - 3 table walks needed (unless TLB locked down)
  - Write buffer full, bus not granted by default

| CYCLES<br>(approx) | Fast<br>Interrupt<br>mode | Max Regs<br>in LDM | Write<br>through D<br>cache | TLB<br>locked<br>down |
|--------------------|---------------------------|--------------------|-----------------------------|-----------------------|
| 171                |                           | 16                 |                             |                       |
| 148                | $\checkmark$              | 16                 |                             |                       |
| 129                | $\checkmark$              | 9                  |                             |                       |
| 63                 | $\checkmark$              | 16                 | $\checkmark$                | $\checkmark$          |
| 48                 | $\checkmark$              | 9                  | $\checkmark$                | $\checkmark$          |



### **ARM1020E Dynamic Power**



THE ARCHITECTURE FOR THE DIGITAL WORLD<sup>™</sup>

Hot Chips 13

12

#### **ARM10E Dynamic Power**



#### **Power Down**



THE ARCHITECTURE FOR THE DIGITAL WORLD<sup>™</sup>

Hot Chips 13

14

#### **Power Down**



#### **VFP10**

- Full IEEE 754 compliant (with SW support)
- Performance:
  - 236 MFLOPS Linpack (SAxPY) @ 400MHz
  - 400M FIR Taps (800 Peak MFLOPS) @ 400MHz
- Functions supported in hardware
  - Multiply, add, multiply-add, subtract, multiply-subtract, negate, negate multiply, negate multiply-add, negate multiply-subtract, absolute value, compare, convert, divide and square root, conversions
- Most IEEE 754 exceptions handled in hardware
- RunFast mode
  - No trapping enabled (Denormals flush to +0)
  - NaN fractions not propagated (not typical)



#### **VFP10**

- 7 Stage pipeline
  - Fetch Issue Decode Execute (E 1) E 2 E 3 E 4/WB
- 32 Single precision / 16 Double precision registers
- High performance short vector operations
  - Register banks operate as hardware circular queues and can be addressed as short vectors (up to 8 values)
- Separate divide/square root unit
  - Supports load/store, and arithmetic operation in parallel with divide/square root operation
- Separate load/store unit
  - Load/store operations may be done in parallel with data processing operations
  - 64-bit unidirectional data interfaces
- Area: ~1.6mm<sup>2</sup> in 0.13um

#### **ETM10**

#### Embedded Trace Macrocell



#### **ETM10**

- Full real-time instruction and data tracing
- Monitors the core's *internal* buses
- Zero performance overhead
  Supports high frequency trace with demux-port
- Configurable synthesis for optimum:
  - area
  - features
  - pin count
- Programmed non-intrusively through JTAG



#### **ARM1020E Family Summary**

ARM1020E: 500 DMIPS @400 MHz 0.51 mA/MIPS 10.3mm<sup>2</sup> / 6.9mm<sup>2</sup>

VFP10: 236 MFLOPS @400MHz IEEE 754 Compatible

ETM10: Full speed, real time embedded trace



