

# StrongARM SA110

A 160Mhz 32b 0.5W CMOS ARM Processor

Sribalan Santhanam Digital Equipment Corporation Hot Chips 1996







### Overview

- Highlights
- Design choices
- μArchitecture details
- Powerdown Modes
- Measured Results
- Performance Comparisons
- Summary







# **Processor Highlights**

#### **♦** Target Market Segments

- Embedded consumer applications
- PDA's, set-top boxes, Internet browsers

#### Function

- Implements ARM V4 Instruction set
- Bus compatible with ARM 610,710 and 810

#### Performance

- Record breaking perfomance/watt and price/performance
- 160Mhz @ 1.65V delivers 185 Dhrystone MIPS at < 0.45W
- 215Mhz @ 2.0V delivers 245 Dhrystone MIPS at < 0.9W</li>







# **Processor Highlights**

#### Process

- 2.5 Million transistors (2.2 Million in caches)
- 3 Metal CMOS
- $-t_{ox}$  of 60 Å, L<sub>EFF</sub> of 0.25 micron, and V<sub>T</sub> of 0.35v

### Packaging

- -7.8mm X 6.4mm -> 50mm<sup>2</sup>
- 144 pin plastic TQFP







### **StrongARM Design Choices**

- Chose a simple design with low latency functional units to fit portable power budgets
  - Simple single issue 5 stage pipeline
  - Long tick model, low latency
  - Could have pipelined deeper for faster cycle time but would have exceeded the power budget
  - Could have gone superscalar but that would have increased control logic cost and power and increased per cycle memory interface needs
  - Would have increased design time







### **StrongARM Design Choices**

#### Power reduction

- Run core at low voltage and I/O at standard voltage
- Scale technology
- Only run logic section needed in a cycle
  - Conditional clocking: Only drive clocks to sections running
  - Edge triggered flip flops allowed reduction in number of latches

#### Result

- Best Mips/Watt in the industry
- A core voltage of 1.65V yields 411 Mips/Watt @160MHz







### **Power Reduction Factors**

**Start with Alpha 21064: 200Mhz @ 3.45V : Power 26W** 

VDD reduction: Power Reduction: 4.4x to 5.9W

Reduce functions: Power Reduction: 3x to 2.0W

◆ Scale Process: Power Reduction: 2x to 1.0W

Clock Load: Power Reduction: 1.5x to 0.6W

Clock Rate: Power Reduction: 1.25x to 0.5W







# StrongARM µArchitecture Highlights

- Simple 5 stage pipeline
- Early branch execution
- Integer datapath with single cycle shift and add
- **♦** 5 register file ports
- High performance integer multiplier
- Split I/D caches
- Asynchronous and Synchronous bus interface







# SA110 Block Diagram





# **Basic Pipeline**

| 100: ADDS R1,                                                    | F<br>pc<-100   | l I        | E           | В           | W           |         |       |       |
|------------------------------------------------------------------|----------------|------------|-------------|-------------|-------------|---------|-------|-------|
|                                                                  | ib<- ADDS      | Read rm,rn | w<- rn+rm   | w' <- w     | R1<-w'      |         |       |       |
|                                                                  |                |            | cc<-alu.cc  |             |             |         |       |       |
|                                                                  |                |            | \           |             |             |         |       |       |
| 104. 11                                                          | DR R0, [R1,d]! | F          | I           | E           | В           | W       |       |       |
| 104. 21                                                          |                | pc<-104    | Read Rm, Rn | w,la<- d+R1 | L<- mem(la) | R0<- L  |       |       |
|                                                                  |                | ib <- LDR  |             |             | W' <- W \   | R1<-w'  |       |       |
|                                                                  |                |            |             |             |             |         |       |       |
| 108: SUB x,R0                                                    |                | F          | I           | I           | E           | В       | W     |       |
| 108. 30B X,NO                                                    |                |            | pc<-108     | Read RM, RN | Read RM, RN | `w<-R0- | W'<-W | X<-W' |
|                                                                  |                |            | ib <-SUB    |             |             |         |       |       |
|                                                                  |                |            |             |             |             |         |       |       |
| Basic 5 stage pipeline Instruction Fetch (F)                     |                |            | 10C: xxx    | F           | F           | I       | E     | В     |
| Register read and Issue checks (I) Execute/Effective Address (E) |                |            |             |             |             |         |       |       |
| Buffer and cache access (B) (Register file Write (W)             |                |            |             |             |             |         |       |       |
|                                                                  |                |            |             |             |             |         |       |       |
| Arrows show                                                      | v data forward | ling paths |             |             |             |         | _     |       |







# **Branch Example**









# Multiplier

◆ Perform signed and unsigned multiply and multiply accumulate producing 32 or 64 bit results.

32b\*32b->64b 32b\*32b+64b->64b

32b\*32b->32b 32b\*32b+32b->32b

- Multiply accumulate folds accumulate into mul array
- **♦** Multiplier array retires 12 bits of the multiplier per cycle
- Early out for short multipliers
- ◆ Multiplier adder produces 32 bits of result per cycle
- **◆** Latency of 2-4 cycles for 32 bits and 3-5 cycles for 64 bits







### **StrongARM Caches**

#### **♦** Separate 16KB I and D caches

- 32 byte block 32-way set associative
- Dcache is writeback with no write allocation
- Cache tags are virtual addresses
- Dcache stores physical address with cache line.
- Caches occupy half of total chip area
- Self-timed to save power







# StrongARM Cache design tradeoffs

#### ♦ Why a 32-way associative cache?

- Wanted Associativity of at least 4-way
- Wanted to minimize power so cache was divided into 8 banks so that only one eighth of the cache was enabled per access.
- Required Single cycle access for read and writes
- Our implementation provided a 32-way associative for free meeting the above criteria.

### ◆ Writes are done in single cycle same as reads

- removes need for buffer and tag for read after write hazard
- read done before write
- Memory management exceptions writeback original data





### **MMUs and Write Buffer**

#### Memory Management Units

- Seperate I and D MMU's, each with 32 fully associative entries which can map 4KB,64KB or 1MB page
- ARM architecture includes extensions to memory management protection for efficient support of object oriented systems.
  - Additional checks must be performed in series with TLB lookup
  - Self timing required to perform lookup and protection checks in one cycle

#### Write Buffer

Eight 16 byte entries with single entry merge buffer







### Other StrongARM SA110 Features

- Shift + ALU operation in a single cycle
  - Provides low latency with simple control logic
- Shifter bypassed for shifts by zero
  - Provides power savings when shifter not needed
- ♠ MOV PC, Rx executed in Issue stage
  - Provides low latency for subroutine returns







# StrongARM SA110 System Interface

#### Clocking

- 3.68 MHz input clock multiplied by on-chip low power PLL
  - Generates 16 frequencies from 88MHz to 287MHz
  - Power dissipation = 1.5mW
- System interface clock can
  - Either be driven by the core at 1/2 to 1/9 of the core frequency
  - Or be driven by the system at any frequency up to 66MHz

### System interface compatible with existing ARM parts

- Separate 32-bit address and data bus
- Enhancements to current ARM bus provides for wrapped reads to return critical word first and general write merging







### **Powerdown Modes**

#### ◆ Idle Mode

- For short periods of inactivity with quick restart
- Clock trees and all local clocks stop
- PLL continues to run
- Power consumption limited to 20mW

### Sleep Mode

- For extended periods of inactivity
- Core power supply turned off
- I/O remains powered and maintain bus state
- Standby current limited to 50μA







### **Measured Results**

- Running Dhrystone on StrongARM in evaluation board. MCLK frequency = 1/3 PLL frequency.
- Dhrystone fits in cache so internal clocks are running at full speed.
- Measurements taken for I/O Vddx = 3.3v and core Vdd = 1.65v and 2.0v
  - For Vdd = 1.65v, Total power = 2.54mW/MHz -> 254mW @ 100MHz & 406mW @ 160MHz
  - For Vdd = 2.0v, Total power = 3.3mW/MHz ->528mW @ 160MHz & 710mW @ 200MHz
- ◆ Typical power much less on more realistic applications







### Simulated Power Breakdown

| ICACHE              | 27%  |
|---------------------|------|
| IBOX                | 18%  |
| DCACHE              | 16%  |
| CLOCK               | 10%  |
| IMMU                | 9%   |
| EBOX                | 8%   |
| DMMU                | 8%   |
| <b>WRITE BUFFER</b> | 2%   |
| BIU                 | 2%   |
| PLL                 | < 1% |







# StrongARM SA110 Die photo









# **Performance Comparison**









### **StrongARM SA110 Summary**

#### Function

- Implements ARM Version 4 instruction set
- Bus Compatible with ARM 610, 710, and 810

#### Performance

- Best performace/watt and price/performace
- 160MHz @ 1.65v -> 185 Dhrystone MIPS at < 0.45W</li>
- 215MHz @ 2.0v -> 245 Dhrystone MIPS at < 0.9W</li>

### Process and Package Technology

- 2.5 million transistors fabricated in 0.35  $\mu m$  3 metal CMOS with 0.35v  $V_T$  and 0.25  $\mu m$   $L_{EFF}$
- Die size: 50mm² in a 144 pin plastic TQFP



