

## **SpursEngine**<sup>™</sup>

#### A High-performance Stream Processor Derived from Cell/B.E.<sup>™</sup> for Media Processing Acceleration

Hiroo Hayashi Toshiba Corporation Semiconductor Company Advanced SoC Development Center

SpursEngine and the logo are trademarks of Toshiba Corporation in Japan, the United States and other countries. Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment Incorporated.

## Outline

- Background
- SpursEngine<sup>™</sup> Architecture Overview
- Implementation
- Programming Model
- Application and Performance
- Conclusion



## Outline

### Background

- SpursEngine<sup>™</sup> Architecture Overview
- Implementation
- Programming Model
- Application and Performance
- Conclusion



#### Emergence of HD Contents

 High Definition (HD) era has been quickly emerging. For instance, video content from digital terrestrial broadcasting, digital video cameras, and optical disks are all HD ready.

#### HD contents require much more processing power than SD contents.

- HD needs 6 times higher bandwidth than Standard Definition (SD). Conventional PC architecture of CPU and GPU can only decode HD video in real-time.
- CPU, even though it keeps getting faster, will not be capable of real-time encoding HD video in the near future.





### Future Demand: Video Indexing and Searching

**Next generation HD video solutions require** *intelligent software capabilities*. -*Advanced and Flexible algorithms* for RMS (Recognition, Mining, Synthesis) are hard to be provided by H/W solutions.



TOSHIBA

Leading Innovation >>>

### Video data

- is *large*.
- can be efficiently *compressed*.
- is stored and distributed *in encoded format*.
- *must be decoded* before being used (processed).
- requires huge processing power.
  - Searching and matching is not easy.

Combination of encode/decode and huge image processing power is crucial.



## Outline

- Background
- SpursEngine<sup>™</sup> Architecture Overview
- Implementation
- Programming Model
- Application and Performance
- Conclusion



### SpursEngine<sup>™</sup> Architecture Overview

 Media processing accelerator derived from Cell Broadband Engine<sup>™</sup> (Cell/B.E.<sup>™</sup>)

- 4 SPEs in SpursEngine<sup>TM</sup> are fully compatible with those of Cell/B.E<sup>TM</sup>.

- Supports Full-HD video contents efficiently
  - Hardware CODECs of standard formats are embedded in the same chip.





SPE : Synergistic Processing Element XIO is trademark of Rambus Inc. in the United States and other countries.

#### **TOSHIBA** Leading Innovation >>>

### Synergistic Processor Element (SPE)

- A simple SIMD processor core derived from Cell/B.E.<sup>™</sup>.
- Architecture focus on data processing
  - RISC Type ISA (Instruction Set Architecture)
    - High Level Programming Language •
  - SIMD Architecture
  - 128bit x 128 Large Register File
  - 256KB Local Storage
    - predictable latency •
      - no cache miss
      - no address translation miss
    - wide data path
  - MFC : high performance DMA engine
  - High compute density
    - Power-efficient
    - Area-efficient •

TOSHIBA

Leading Innovation >>>

PPE: Power Processor Element SPE: Synergistic Processor Element MFC: Memory Flow Controller LS : Local Storage



SPE RF SPU LS MFC-DMAD Cell/B.E.<sup>™</sup> PPE SPE SPE SPE SPE XDR SPU SPU SPI SPU DRAM PPU LS LS LS LS MFC L2 MFC MFO MFC Cache Element Interconnect Bus (EIB) MFC Cell/GPU etc. BIC LS SPU Super Companion Chip SPE I/O

## Tightly Coupled for High Bandwidth

- Video applications require much higher bandwidth than conventional applications.
- Processor elements, encoders/decoders, and high bandwidth local memories should be tightly coupled together to guarantee high data transfer bandwidth for HD video processing architecture.





## Backend Processor for No Interference in PC bus

 The stream processor should be located in backend to prevent its interfering in PC system bus with CPU and GPU.



## Outline

- Background
- SpursEngine<sup>™</sup> Architecture Overview
- Implementation
- Programming Model
- Application and Performance
- Conclusion



## SpursEngine<sup>™</sup> Block Diagram



XDR<sup>™</sup> DRAM is trademark of Rambus Inc. in the United States and other countries.

TOSHIBA

Leading Innovation >>>

## **Specification Overview**

#### • SPE : Synergistic Processor Element

- clock frequency: 1.5GHz
- number of SPEs: 4
- H.264 Encoder
  - High Profile@Level 4.1 support
  - max. 1920x1080 60i/24p, 1280x720 60p
  - clock frequency: 150MHz

#### H.264 Decoder

- High Profile@Level 4.1 support
- max. 1920x1080 60i/24p, 1280x720 60p
- clock frequency: 300MHz

#### MPEG2 Encoder

- MP@HL support
- max. 1920x1080 60i/24p, 1280x720 60p
- clock frequency: 150MHz

#### MPEG2 Decoder

TOSHIBA

Leading Innovation >>>

- MP@HL support
- max. 1920x1080 60i/24p, 1280x720 60p
- clock frequency: 300MHz

#### SCP : Control Processor

- Toshiba original core
- 4K I\$, 16K D\$, 24KB Instruction Memory
- clock frequency: 300MHz
- 4ch DMA

#### PCI Express Controller

- PCE Express Base Specification 1.1
- endpoint
- x4, x2, x1 link support

#### Memory Controller

- 32bit or 16bit x 1ch
- memory capacity: 64MB 1GB
- bandwidth 12.8GB/s @32bit data width
- Power Management Function
  - power state (D0, D1, D3hot, D3cold)
  - clock gating
  - clock frequency control

## SpursEngine<sup>™</sup> Physical Implementation



- Process:
  - 65nm bulk CMOS with 7 levels of copper layers
- Die Size:
  - 9.98mm x 10.31mm, 102.89mm<sup>2</sup>
- Fmax: 1.5GHz
- Transistor Counts:
  - 239.1M
    - Logic: 134.3M
    - SRAM: 104.8M
- TDP
  - < 20W (depending on application sets)</p>
- Package:
  - FC BGA 624



\* XIO™ is trademarks of Rambus Inc. in the United States and other countries.

### Backend Design Feature: Synthesizable 1.5GHz SPE

### Synthesizable design

 shorter design TAT and less manpower (estimation: 1/4 compared with custom migration)

### Key Feature

- floor plan optimization
- hybrid standard cell height
- RTL abstraction
- register retiming
- semi-custom clock tree
- wire width control



## Floor Plan Optimization

### Original Floor Plan (Cell/B.E.<sup>™</sup>)

Cell/B.E.<sup>™</sup> (65nm) 2.09mm x 5.30mm

on

TOSHIBA

Leading Innovation >>>

### Square & Compact

#### more cell placement flexibility

- area:
  - 27% smaller
- total wire length:
  - 28% shorter

### **New Floor Plan**

SpursEngine<sup>™</sup> (65nm) 2.07mm x 3.93mm



L/S : Local Storage MFC : Memory Flow Controller

## Hybrid Standard Cell Height Implementation

- 12 grid height and 9 grid height standard cell libraries are used to minimize delay, area, and power.
  - 12 grid height is used for 1.5GHz design block (SPEs and EIB)
  - 9 grid height is used for other clock domains

Frequency & Area Estimation in Monolithic SC Height Implementations





\*) under the condition where path delay is the same with power supply voltage control



## Outline

- Background
- SpursEngine<sup>™</sup> Architecture Overview
- Implementation
- Programming Model
- Application and Performance
- Conclusion



### SpursEngine<sup>™</sup> Software Programming Environment





## **Common SPU Runtime**

#### Cell/B.E.<sup>™</sup> Common SPU Runtime

- offers common API to various Cell/B.E.<sup>™</sup> systems.
- can be applied to a SpursEngine<sup>™</sup> system.
- offers scalability of SPU program
  - Multiple tasks are distributed to multiple SPEs
  - All tasks are scheduled without interactions to host processor

#### MARS will be ready for SpursEngine<sup>™</sup>

MARS: common SPU runtime for Cell/B.E.<sup>TM</sup> designed by SONY. ftp://ftp.infradead.org/pub/Sony-PS3/mars/



## Outline

- Background
- SpursEngine<sup>™</sup> Architecture Overview
- Implementation
- Programming Model
- Application and Performance
- Conclusion



### New applications by SpursEngine<sup>™</sup>

- Super-Real-Time Transcoding: Transcoding at faster than real-time
- Indexing: Video categorizing during HDD storing, DVD burning
- Gesture I/F: Control various devices in the living room by hand gestures
- Super-Resolution: Picture resolution upconverting for HDTV
- Face Tracking: Realtime 3D face tracking for communication tools
- Interactive Gaming: New type of real-time game with Gesture I/F and Face Tracking
- Editing: Video editing of consumer generated content



### Flexible Transcoding

- Combining full-encode and decode hardware with SPE software, SpursEngine<sup>™</sup> provides flexible transcoding features.
- Codec (Hardware)
  - MPEG-2 to MPEG-2, H.264 to H.264
  - MPEG-2 to H.264, H.264 to MPEG-2
- Translation type (Software)
  - To support various and future specifications, SPE software flexibly converts resolution, frame rate, and aspect ratio with transcoding simultaneously.

#### Application

| Device                     | Function                                                                                                                      |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| HD Media                   | Translating frame rate and aspect ratio while keeping resolution for HD media recoding purpose.                               |
| SD Media<br>(DVD-VR/Video) | Translating frame rate and aspect ratio with scaling resolution down with pixel averaging algorithm for DVD recoding purpose. |
| Mobile                     | Translating to CIF and QVGA for mobile devices.                                                                               |

### Super-Real-Time Transcoding

#### • SpursEngine<sup>™</sup> performs transcoding faster than real-time.

 The numbers in the table below show how much times faster than real-time SpursEngine<sup>™</sup> performs transcoding.

| Codec               | HD Media<br>1920x1080<br>to<br>1920x1080 | Digital Terrestrial<br>1440x1080<br>to<br>1440x1080 | HD Media<br>1920x1080<br>to<br>SD 720x480 | Digital Terrestrial<br>1440x1080<br>to<br>SD 720x480 | SD<br>720x480<br>to<br>720x480 |
|---------------------|------------------------------------------|-----------------------------------------------------|-------------------------------------------|------------------------------------------------------|--------------------------------|
| MPEG-2 to<br>MPEG-2 | x 2.1                                    | x 2.4                                               | x 1.8                                     | x 2.3                                                | x 8.7                          |
| MPEG-2 to<br>H.264  | x 2.0                                    | x 2.4                                               | x 1.6                                     | x 2.2                                                | x 8.2                          |
| H.264 to<br>H.264   | x 1.8                                    | x 2.3                                               | x 1.6                                     | x 2.2                                                | x 6.3                          |
| H.264 to<br>MPEG-2  | x 1.8                                    | x 2.3                                               | x 1.7                                     | x 2.2                                                | x 6.4                          |

- The above numbers do not mean the minimum guaranteed speed.
- They were measured on 2008/05/e implementation and will be improved in future.
- Conditions
  - Input: 29.97 fps; MPEG2 720:4.6Mbps 1440:17.4Mbps, 1920:17.7Mbps; H.264 720:2Mbps, 1440:7.9Mbps, 1920:7.9Mbps
  - Output: 29.97fps, MPEG2 720:4.8Mbps 1440:19.1Mbps, 1920:19.1Mbps; H.264 720:2Mbps, 1440:7.8Mbps, 1920:7.8Mbps

### Transcoding Performance Comparison

Leading Innovation >>>



## **Transcoding Power-Performance Comparison**

- Power-Performance comparison on transcoding from MPEG-2 (1920x1080, 25Mbps) to H.264 (1920x1080, 15Mbps)
- ~18x Better Power-Performance

|                   | MB (W) (*1) | SpursEngine™ (W) (*2) | Total Power(W) | Time (ratio) | Power x Time (ratio) |
|-------------------|-------------|-----------------------|----------------|--------------|----------------------|
| CPU only          | 29          | 0                     | 29             | 10           | 18                   |
| CPU +SpursEngine™ | 3           | 13.3                  | 16             | 1            | 1                    |

- (\*1) Increased total power of the mother board compared to the idle state. About 10% error may be included due to the constraint of measure method.
- (\*2) Total power consumption of SpursEngine<sup>™</sup>. Average of 3 typical condition samples.
- Mother Board: ASUS P5K-V, memory 2GB
- Processor: Intel<sup>®</sup> Core<sup>™</sup>2 Duo E6750 (65nm, 2.66GHz, TDP 65W)
- CPU only : Ulead VideoStudio 12®
- CPU + SpursEngine<sup>™</sup> : Original program for SpursEngine<sup>™</sup>

## Indexing (Face Navigation)

- Actor's faces in video contents are recognized and classified by SpursEngine<sup>™</sup>, and listed on the screen.
- User can playback a specified video scene by clicking an actor's face image.
- It's easy to find highlight scenes from the 'clapping and cheers graph'.
- Easily playback the movie from your desired scene by selecting a thumbnail.





Leading Innovation >>>

### Face Detection Performance Comparison

 SpursEngine<sup>™</sup> can process face detection processing in real-time during transcoding with H/W CODEC.



- Processing time of the face detection for each frame of a reference video image. The processing time depends on the number of faces in each input video frame.
- Transcoding is not included in this comparison.
- Intel® E6750 Core™2 Duo 2.66GHz, 2GB RAM
- Intel C++ Compiler9.1

Leading Innovation >>>

option: /QaxT /c /O3 /Ot /FD /EHsc /MT /GS /GR /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd

### Gesture Interface Remote Control

• Hand gesture recognition by Toshiba original algorithm.



- PC preinstalled DVD player, digital TV tuner application, Windows® Media Center, and Power Point can be controlled.
- Pops up remote control buttons suitable for each application.



### Gesture Interface Performance Comparison

- 2 SPEs achieve 30 FPS processing speed which is required for smooth real-time gesture control.
- Host processor is reserved for its productive work.



- Mean FPS for 7 sets of input images.
- Intel® E6750 Core™2 Duo 2.66GHz, 2GB RAM, 1 core
- Intel C++ Compiler9.1

Leading Innovation >>>

– option: /QaxT /GL /c /O3 /Ot /FD /EHsc /MT /GS /GR /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd

### Super-Resolution

- Convert a Standard Definition (SD) video to a sharper and clearer High Definition (HD) video using the super-resolution algorithm developed by Toshiba.
- Utilizes SpursEngine<sup>™</sup>'s processing performance to execute immense numbers of calculations needed for super-resolution.





## **One-Frame Super-Resolution**

- An upconverting method which uses similar image patterns as well as the original image patterns to estimate high resolution images.
  - Generates accurate and sharp high resolution images



## Super-Resolution Performance Comparison

#### Super-Resolution kernel code processing time

- Processing time per frame : SpursEngine<sup>™</sup> (4 SPEs) is 4.2 times faster than Core<sup>™</sup>2 Duo (2 cores).
- SpursEngine<sup>™</sup> can execute this super-resolution processing and process encode and decode by using H/W CODEC *in parallel*.
- Core<sup>™</sup>2 Duo has to execute all these processes by using two cores by turns.
- Total Processing Time (Estimation) :
  - SpursEngine<sup>™</sup> is ~8 times faster than Core<sup>™</sup>2 Duo.

|                                                | time/            |            |       |
|------------------------------------------------|------------------|------------|-------|
| step                                           | Core2 Duo (msec) | SPE (msec) | ratio |
| create a temporary HD frame                    | 63.3             | 12.0       | 5.3   |
| find similar patterns using self-congruency    | 137.0            | 80.6       | 1.7   |
| improve the qualitiy of the temporary HD frame | 91.0             | 47.4       | 1.9   |
| total                                          | 291.3            | 140.1      | 2.1   |

#### Super-Resolution Kernel Code Processing Time (1 core)

- YUV420: 720x480→ 1440x1080
- Intel® E6750 Core™2 Duo 2.66GHz, 2GB RAM
- Intel C++ Compiler9.1
  - option: /QaxT /GL /c /O3 /Ot /FD /EHsc /MT /GS /GR /Fo"Release/" /W3 /nologo /Wp64 /Zi /Gd



## Outline

- Background
- SpursEngine<sup>™</sup> Architecture Overview
- Implementation
- Programming Model
- Application and Performance
- Conclusion



### Conclusion

### • SpursEngine<sup>™</sup> has been successfully developed.

- SpursEngine<sup>™</sup> architecture is designed for intelligent HD video processing.
- It works as a back-end processor.
- It provides both flexible programmability and optimal power / performance.

### Its video processing capabilities are demonstrated.

- flexible super-real-time transcoding
- face detection processing in real-time during transcoding
- gesture control without disturbing a host processor
- fast super-resolution processing

#### Additional Information

<u>http://www.cellusersgroup.com/</u>

#### References

- Dac C. Pham, et al. "The Design and Implementation of a First-Generation CELL Processor", ISSCC 2005, pp.184-185.
- B. Flachs, et al. "A Streaming Processing Unit for a CELL Processor", ISSCC 2005, pp.134-135.
- Michael Gschwind, et al. "A novel SIMD architecture for the Cell heterogeneous chipmultiprocessor ", Hot Chips 17 (2005).
- Ryuji Sakai, Seiji Maeda, et al. "Programming and Performance Evaluation of the CELL Processor", Hot Chips 17 (2005).
- Kiyoji Ueno, et al. "A Design Methodology Realizing an Over GHz Synthesizable Streaming Processing Unit", VLSI Symposium 2007, pp. 48-49.

#### Special thanks to

- Toshiba PC company
- Toshiba R&D Center Multi-Media Lab. and Network Computing Lab.
- Toshiba Digital Media Company Core Technology Center
- Toshiba semiconductor company SpursEngine<sup>™</sup> Development team
- and so many others who contributed to SpursEngine<sup>™</sup>

# **TOSHIBA** Leading Innovation >>>