# SH4 RISC Microprocessor for Multimedia

Fumio Arakawa, Osamu Nishii, Kunio Uchiyama, Norio Nakagawa Hitachi, Ltd.

HITACHI

HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

#### Outline

- 1. SH4 Overview
- 2. New Floating-point Architecture
- 3. Length-4 Vector Instructions
- 4. 3DCG Performance
- 5. Double Precision Support
- 6. Conclusions

HOT Chips IX in August, '97

#### **SH4 Overview**

- Hitachi's SuperH Series Family
- For Consumer Multimedia Systems Home Video Game, Handheld PC
- Excellent Performance with Consumer Price 300 VAX MIPS
- Excellent 3DCG-Performance with Consumer Price 5.0 M Polygons/sec\*
- IEEE 754 Standard Floating-point Architecture Double-precision with Hardware Emulations

\* measured with an original simple geometry benchmark

HITACHI -

HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

### SH4 Specifications

| Technology  | 0.25 $\mu$ m CMOS, 5 Layer Metals            |
|-------------|----------------------------------------------|
| Voltage     | 1.8 V (I/O: 3.3 V)                           |
| Frequency   | 167 MHz (internal) / 83,55, etc. MHz (I/O)   |
| Performance | 300 MIPS (Dhrystone), 1.17 GFLOPS (peak)     |
| Cache       | 8/16 KB (Inst./Data) Direct-mapped           |
| TLB         | 4/64-entry (Inst./Unified) Fully-associative |
| Interfaces  | SRAM, DRAM, SDRAM, burst ROM, PCMCIA         |
| Peripherals | DMAC, SCI, RTC, Timer                        |

# Pipeline Stages

- Simple Five-stage Pipelines
- Two-way Superscalar

| Integer        | Inst. Fetch | Inst. Dec.<br>Reg. Read  | Exec.                 |                  | Write Back              |
|----------------|-------------|--------------------------|-----------------------|------------------|-------------------------|
| Floating Point |             | Inst. Dec.<br>Reg. Read  | 1st Exec.             | 2nd Exec.        | 3rd Exec.<br>Write Back |
| Load / Store   |             | Inst. Dec.<br>Reg. Read  | Addr. Gen.            | Memory<br>Access | Write Back              |
| Branch         |             | Inst. Dec.<br>Addr. Gen. | Target<br>Inst. Fetch |                  |                         |
|                |             |                          |                       | HI               | асні —                  |

#### HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

#### Superscalar Issue Combinations

|                      | INT | FP | LS | BR | BO | NS |
|----------------------|-----|----|----|----|----|----|
| Integer (INT)        | Χ   | 0  | 0  | 0  | 0  | Χ  |
| Floating Point (FP)  | 0   | Χ  | 0  | 0  | 0  | Χ  |
| Load / Store (LS)    | 0   | 0  | Χ  | 0  | 0  | Χ  |
| Branch (BR)          | 0   | 0  | 0  | Χ  | 0  | Χ  |
| Both INT & LS (BO)   | 0   | 0  | 0  | 0  | 0  | Χ  |
| Not Superscalar (NS) | Х   | Х  | Χ  | Χ  | Х  | Χ  |

INT: Add, Subtract, Shift, etc.

FP: Floating-point Add, Subtract, Multiply, Divide, etc.

LS: Load/Store/Transfer from/to Integer/Floating-point Register, etc.

BR: Branch Always/Conditionally, etc.

BO: Move between Integer Registers, Integer Compare, etc.

NS: Load to Control Register, etc.

# Floating-point Arch. Enhancement

- Two Sets of 16 Single Precision Registers
  - The extra set fits 4 by 4 matrix storage
- Length-4 Vector Instructions
  - Inner Product
  - Transform Vector
- Register Pair Load/Store/Transfer Instructions
  - Enough bandwidth for vector operations
- Double Precision Format Mode

HOT Chips IX in August, '97 SH4 RISC Microprocessor for Multimedia **Floating-point Instructions** - Common - Single Precision Mode Only - FADD (add) - FMAC (multiply-Accumulate) - FSUB (subtract) - FIPR (Inner Product) - FMUL (multiply) - FTRV (Transform Vector) - FDIV (divide) - FSQRT (Square Root) - Double Precision Mode Only - FCMP (Compare) - FCNVDS (Convert Double to Single) - FNEG (Negate) - FCNVSD (Convert Single to Double) - FABS (Absolute Value) - FLOAT (Convert Integer to float) - FTRC (Convert float to Integer) - FMOV (Move from/to Register)

HITACHI -

#### **Inner Product Instruction**

- 16 Registers = 4 (Length-4) Vector Registers

```
fv0 = (fr0 ,fr1 ,fr2 ,fr3 )
fv4 = (fr4 ,fr5 ,fr6 ,fr7 )
fv8 = (fr8 ,fr9 ,fr10,fr11)
fv12 = (fr12,fr13,fr14,fr15)
```

- Operation

```
frn' = (fvm,fvn) <sup>m,n: 0,4,8,12</sup>
```

```
n'= n+3
```

- 1 cycle Pitch
- 4 cycle Latency
- Vector Normalization, Intensity Calculation, Surface Judgment

нітасні —





#### HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

# Inner Product Accuracy

- Inner Product Inst. is an Approximate Inst.
  - No Accurate Intermediate Value

(the width is too wide to implement)

- More Accurate than the Worst Order Multiply and Add Inst. Combinations
- Maximum Error:
  - (Maximum Product x 2<sup>-25</sup>) + (Result x 2<sup>-23</sup>)
  - If source operands are rounded values, this is enough accuracy.

- For example: 4 Products are  $2^{26}$ ,  $-2^{26}$ , 1, 0. Accurate Result = 1. Inner Product Inst. Result = 0.

# Inner Product v.s. SIMD Multiply-Add

|                            | Inner Product | 4 Multiply-Add             |
|----------------------------|---------------|----------------------------|
| Peak Performance           | 1             | <b>1</b> <sup>1</sup> )    |
| Latency                    | 4             | <b>12</b> <sup>2) 3)</sup> |
| Register Port (Read/Write) | 8/1           | 12/4                       |
| Normalizer & Rounder       | 1             | 4                          |
| Floating-point Hardware    | 1             | <b>2</b> <sup>4)</sup>     |

1) Inner Product Inst. and 4 Multiply-Add achieve the same peak performance, which is one inner product per cycle.

2) SH4 takes 3 cycles for Multiply-Add.  $4 \times 3 = 12$ .

3) Eight more cycles must be filled with independent insts. for the peak performance with SIMD architecture.

4) Twice more hardware is necessary for SIMD architecture.

HITACHI -

13

HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

### Elastic Pipeline

- Pipeline stages become 4 cycles for vector Inst.

- In-Order Issue, Out-of-Order Completion

| ID<br>RR | FX | F1 | F2 | F3<br>WB |
|----------|----|----|----|----------|
| ID<br>RR | AG | MA | WB | Load     |

Floating-point Pipeline (Vector)

d/Store Pipeline

- Only Floating-point Non-Vector Arithmetic Inst. right after Vector Inst. is interlocked.

| ID<br>RR | FX       | F1       | F2 | F3<br>WB | Floa     | nting-point Pipeline<br>_ (Vector)      |
|----------|----------|----------|----|----------|----------|-----------------------------------------|
| ID<br>RR | ID<br>RR | ID<br>RR | F1 | F2       | F3<br>WB | Floating-point Pipeline<br>(Non-Vector) |
| -        |          |          |    |          |          | - (,                                    |

Interlock with Resource Conflict

ID: Instruction Decode, RR: Register Read, FX,F1,F2,F3: Floating-point Execution, AG: Address Generation, MA: Memory Access, WB: Register Write Back

# **Transform Vector Instruction**

- Extra 16 Registers = 4 by 4 Matrix

matrix =  $\begin{pmatrix} xf0 & xf4 & xf8 & xf12 \\ xf1 & xf5 & xf9 & xf13 \\ xf2 & xf6 & xf10 & xf14 \\ xf3 & xf7 & xf11 & xf15 \end{pmatrix}$ 

xf:  $e\underline{x}$ tra <u>f</u>loating-point register

- Operation

HOT Chips IX in August, '97

**fvn = matrix • fvn** n: 0,4,8,12

- 4 cycle Pitch, 7 cycle Latency
- Coordinate Transformation, Coordinate Transformation Matrix Generation
- No Work Registers

HITACHI —

15

HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

### Why Transform Vector Instruction ?

- Transform Vector Operation = 4 Inner Product Insts. ? NO !!

| - Modification for Transform Vector:           | xv0 = (xf0, xf4, xf8 ,xf12)                                |
|------------------------------------------------|------------------------------------------------------------|
| frn' = (xvm,fvn)                               | xv1 = (xf1, xf5, xf9, xf13)<br>xv2 = (xf2, xf6, xf10,xf14) |
| m: 0,1,2,3, n: 0,4 n'= n+m+8                   | xv3 = (xf3, xf7, xf11,xf15)                                |
| "fv8 = matrix • fv0" is divided into 4 Inner P |                                                            |
| <ul> <li>4 More Work Registers</li> </ul>      | fr9 = (xv1,fv0)<br>fr10 = (xv2,fv0)                        |
| - Complicated and More Operands                | fr11 = (xv3,fv0)                                           |
| - No Generality (Just for Transform Vector     | )                                                          |
|                                                |                                                            |

- Transform Vector Inst. is Better.

# **Transform Vector Implementation**



HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

### Pair Load/Store/Transfer Mode

- Normal Mode:
  - 4-bit Reg. Field represents 16 Regs. of one set.
  - Set specifier must be changed for another set access.
- Pair Mode:
  - 4-bit Reg. Field represents 16 Pair Regs. of all sets.
  - All Regs. can be accessed.
- Transform Vector Throughput: 1 vector / 4 cycles
- Load/Store Throughput: 2 vectors (4 pairs) / 4 cycles
  - Enough for Storing Previous Result Vector and Loading Next Vector during Transform Vector

#### Simple 3DCG Geometry Benchmark



HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

#### **3DCG Geometry Performance**



HOT Chips IX in August, '97

HITACHI

# **Double Precision Support**

- Floating-point Libraries for WindowsCE
  - Double Precision
  - ANSI/IEEE 754 Standard
- Emulation with Single Precision Hardware
  - Best cost-performance way
  - Peak performance is 27.8 MFLOPS.
  - Software emulation is 20 times slower.
  - Double precision hardware is 6 times faster but 2.5 times more.
- New Double Precision Mode
  - Single's code becomes double's code.



HITACHI



# **Double Precision Multiply**

- Four multiply-adds generate product
- Sticky bit is generated from partial products
  - required width: 106b --> 55b



HOT Chips IX in August, '97

SH4 RISC Microprocessor for Multimedia

# Conclusions

- Excellent Performance with Consumer Price
  - 300 VAX MIPS
- Excellent 3DCG-Performance with Consumer Price
  - New Inner Product & Vector Transformation Insts.
  - 5.0 M Polygons/sec
  - 1.17 GFLOPS (peak with the new insts.)
- IEEE 754 Standard Floating-point Architecture
  - Double-precision with Hardware Emulations
  - 27.8 MFLOPS (peak)