From Model to FPGA: Software-Hardware Co-Design for Efficient Neural Network Acceleration

Kaiyuan Guo<sup>1,2</sup>, Lingzhi Sui<sup>1</sup>, Jiantao Qiu<sup>2</sup>, <u>Song Yao<sup>1</sup></u>, Song Han<sup>1,3</sup>, Yu Wang<sup>1,2</sup>, Huazhong Yang<sup>1</sup> <sup>1</sup> DeePhi Technology <sup>2</sup> Tsinghua University, <sup>3</sup> Stanford University

Acknowledgement: Dongliang Xie and DeePhi Engineering Team









## DeePhi Tech

- ightarrow
- Founded by Song Yao, Yu Wang, and Song Han in March 2016 ullet
- FPGA-based solution provider for deep learning ullet



 Automatic compilation tool chain + mini board/IP ✓ Architecture for CNN and RNN-LSTM

### About DeePhi Tech

# Discovering the philosophy behind deep learning computing

Compression Compilation



# Supporting detection, tracking, object/speech recognition, translation, and etc.







- New Platform Expected for Deep Learning
- Trend in Neural Network Design
- Platform Selection
- Overall Flow
- Model Compression: Useful in Real-World Networks
- Activation Quantization: 8 Bits Are Enough
- Aristotle: Architecture for CNN Acceleration
- Descartes: Architecture for Sparse LSTM Acceleration
- Conclusion







Drone

Client

Requirements **Real-time object recognition** 

> Limitation **Battery capacity**

Low-power high-performance platform for deep learning is urgently needed

Page 4

### New Platform Expected for Deep Learning



- Video Surveilliance
- Edge Requirements
- **Real-time video analysis**
- Limitation High maintenance cost



Speech Recognition Cloud

Requirements Low latency

Limitation High maintenance/cooling cost



#### **CNN for Object Recognition** ightarrow



### Source: Ross Girshick, "Fast R-CNN"



Source: Ross Girshick et al., "R-CNN"

## Trend in Neural Network Design

**RNN-LSTM** for Speech Recognition ullet





Source: Hasim Sak et al.," Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition"

### Frameworks for different applications have not been unified









## Trend in Neural Network Design







Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

- Larger model size, higher bandwidth requirement

Page 7

### Trend in Neural Network Design



An RNN-LSTM accelerator should overcome the bandwidth problem



### FPGA is good for inference applications

- CPU: Not enough energy efficiency
- •
- DSP: Not enough performance with high cache miss rate •
- ASIC has high NRE: No clear huge market yet
- ASIC has long time-to-market but neural networks are in evolution
- FPGA
  - Acceptable power and performance
  - Supports customized architecture
  - High on-chip memory bandwidth
  - Relatively short time to market
  - High reliability

### FPGA-based deep learning accelerators meet most products' requirements

### **Platform Selection**

GPU: Extremely efficient in training, not enough efficiency in inference (batch size = 1)



### Software-Hardware Co-Design is Necessary

- Great redundancy in neural networks •
  - VGG16 network can be compressed from 550MB to 11.3MB
  - FPGA has limited BRAM and DDR bandwidth \_\_\_\_
- Different neural network has different computation pattern
  - CNN: Frequent data reuse, dense
  - DNN/RNN/LSTM: No data reuse, sparse \_\_\_\_\_
  - Different architectures must adapt to different neural network \_\_\_\_\_
- Neural networks are in evolution
  - Architecture must adapts to new algorithms



### **Platform Selection**

### Limitations of FPGA platform Limited BRAM size

Limited DDR Bandwidth





Algorithm engineers can simply run the compiler tool to implement FPGA acceleration

Page 10

### Traditional FPGA-based Acceleration Faced Two Major Problem

- Long development period ullet
  - Hand coded: 2 3 months
  - OpenCL and HLS: 1 month
- Insufficient performance and energy efficiency •

### DeePhi's workflow solves the two problems in FPGA acceleration

- Compiler + Architecture instead of OpenCL
  - Algorithm designer need to know nothing about hardware
  - Generates instructions instead of RTL code
  - Compilation in 1 minute
- Much higher performance and energy efficiency
  - Hand-coded IP core and efficient architecture design

### **Overall Flow**



## Model Compression: Useful in Real-World Netoworks Deep Compression: Useful for RNN-LSTM and FC layers in CNN

#### Small DNN models are critical.





#### pruning

weight sharing

| Network    | Original<br>Size | Compressed<br>Size | Compression<br>Ratio | Original<br>Accuracy | Compressed<br>Accuracy |  |  |  |
|------------|------------------|--------------------|----------------------|----------------------|------------------------|--|--|--|
| AlexNet    | 240MB -          | → 6.9MB            | 35x                  | 80.27% -             | → 80.30%               |  |  |  |
| VGGNet     | 550MB -          | → 11.3MB           | 49x                  | 88.68% -             | → 89.09%               |  |  |  |
| GoogleNet  | 28MB -           | → 2.8MB            | 10x                  | 88.90% -             | → 88.92%               |  |  |  |
| SqueezeNet | 4.8MB -          | → 0.47MB           | 10x                  | 80.32% -             | → 80.35%               |  |  |  |

### Source: Song Han et al., Stanford University

With re-training, we can achieve:

- < 10% sparsity for real-world FC layers in CNN
- ~ 15% sparsity for real-world LSTMs
- 4 bit weight quantization with no accuracy loss

Deep Compression is useful in real-world neural networks and can save a great deal of computations and bandwidth demands

ightarrow





#### Image classification on ILSVRC 2012

|             |       | FP32     | FIXED-16 |          | FIXED-8 |          |
|-------------|-------|----------|----------|----------|---------|----------|
|             |       | ORIGINAL | RAW      | RE-TRAIN | RAW     | RE-TRAIN |
|             | Top-1 | 65.77%   | 65.78%   | 67.84%   | 65.58%  | 67.72%   |
| VGG16       | Top-5 | 86.64%   | 86.65%   | 88.19%   | 86.38%  | 88.06%   |
|             | Top-1 | 68.60%   | 68.70%   | 68.70%   | 62.75%  | 62.75%   |
| GoogLeNet   | Top-5 | 88.65%   | 88.45%   | 88.45%   | 85.70%  | 85.70%   |
| SquaazaNlat | Top-1 | 58.69%   | 58.69%   | 58.69%   | 57.27%  | 57.27%   |
| SqueezeNet  | Top-5 | 81.37%   | 81.35%   | 81.36%   | 80.32%  | 80.35%   |

Object detection on PASCAL VOC 2007 ulletR-FCN: < 2% mAP loss without re-training using 8-bit quantization — YOLO: < 1% mAP loss without re-training using 8-bit quantization \_\_\_\_\_

### Activation Quantization: 8 Bits Are Enough







#### Image classification: Results comparison ightarrow



| Goog                                                             | LeNet                | Squee                | zeNet                | VGG16                |                       |  |
|------------------------------------------------------------------|----------------------|----------------------|----------------------|----------------------|-----------------------|--|
| FP32                                                             | FIXED-8              | FP32                 | FIXED-8              | FP32                 | FIXED-8               |  |
| Shetland<br>Sheepdog                                             | Shetland<br>Sheepdog | Shetland<br>Sheepdog | Shetland<br>Sheepdog | Shetland<br>Sheepdog | Shetland<br>Sheepdog  |  |
| Collie                                                           | Collie               | Collie               | Collie               | Collie               | Collie                |  |
| Borzoi                                                           | Borzoi               | Border collie        | Papillon             | Borzoi               | Borzoi                |  |
| Afghan hound                                                     | Pomeranian           | Afghan<br>hound      | Border collie        | Afghan hound         | Papillon              |  |
| Pomeranian                                                       | Afghan hound         | Papillon             | Pomeranian           | Papillon             | Australian<br>terrier |  |
| <ul> <li>Most differences are in low-priority guesses</li> </ul> |                      |                      |                      |                      |                       |  |

Page 14

### Activation Quantization: 8 Bits Are Enough

Object detection: Results comparison – SqueezeNet + R-FCN



Similar proposal results with lower confidence 









## Aristotle: Architecture for CNN Acceleration

### 5.0 cm



- Based on Zynq 7000 Series FPGA
- Optimized for 3x3 Conv kernels
- Supports different Conv stride sizes
- <u>Scalable</u> design (1PE, 2PE, 4PE, 12PE) on Zynq 7010/7020/7030/7045
- Supports mainstream deep learning object framework: R-FCN, YOLO, and etc



) on Zynq 7010/7020/7030/7045 ect framework: R-FCN, YOLO, and etc





- ullet
- Fully pipeline without intermediate data load/store
- Supports dynamic-precision quantization

### Aristotle: Processing Element Architecture

Integrate convolvers, adder tree, non-linearity, and pooling units into one PE



#### Caffemodel -Prototxt

Parser

### Hardware Parameter

### Host CPU

**Neural Network** Accelerator

### From Model to Instructions



**Quantized Model** 





## Descartes: Architecture for Sparse LSTM Acceleration

EIE (Efficient Inference Engine): Extremely efficient, but not for FPGA ightarrow- 102 GOPS@600 mW, 800MHz



### EIE chip (64PE)

- 10.13 MB SRAM
- 64 Multiplier ullet
- 800MHz  $\bullet$

### Xilinx KU060

- $\bullet$
- $\bullet$
- LSTM cannot utilize activation sparsity

Designed by Song Han et al. from Stanford University and published on ISCA 2016

• 4.75 MB BRAM 2760 DSP 250-300MHz

### Xilinx KU115

- 9.49MB BRAM •
- 5520 DSP ullet
- 250-300MHz ullet

FPGA has significantly more computing units but strictly limited on-chip memory



### **Descartes:** Architecture for Sparse LSTM Acceleration FPGA



- ightarrow

#### Page 19

Considers scheduling and non-linear functions in LSTM • Scalable design (16/32/64 PEs for each thread)

Two modes: Batch (high throughput) / No Batch (low latency)





## **Evaluation: Platform and Benchmark for CNN**

#### Platform Comparison ullet



Nvidia Tegra K1 SoC

- 28 nm
- **ARM Cortex-A15 CPU**
- Kepler GPU 192 Cores
- Caffe with CuDNN

#### Benchmark igodol



Page 20



- Xilinx Zynq 7000 Series
- 28nm
- 85k/125k/350k logic cells ullet
- 220/400/900 DSP  $\bullet$
- 4.9/9.3/19.1Mb BRAM ullet

(7020/30/45)(7020/30/45)(7020/30/45)



**Customized Network** Face alignment 104.6 Mop, 9 Conv layers



### Evaluation: Resource Utilization with Aristotle Architecture



|       | LUT   | FF     | BRAM | DSP  |     | LUT      | FF     | BRAM | DSP  |       | LUT    | FF     | BRAM  | DS  |
|-------|-------|--------|------|------|-----|----------|--------|------|------|-------|--------|--------|-------|-----|
| Total | 53200 | 106400 | 140  | 220  | Tot | al 78600 | 157200 | 265  | 400  | Total | 218600 | 437200 | 545   | 90  |
| Used  | 27761 | 26600  | 75   | 220  | Use | ed 43118 | 34097  | 203  | 400  | Used  | 139385 | 85172  | 390.5 | 90  |
| Ratio | 52%   | 22%    | 54%  | 100% | Rat | io 55%   | 22%    | 77%  | 100% | Ratio | 64%    | 19%    | 72%   | 100 |

2 Processing elements Peak performance: 86.4GOPS@150MHz 4 Processing elements Peak performance: 172.8GOPS@150MHz

• Tegra K1 GPU - Peak performance : 326 GFOPS

### Zynq 7030



#### Zynq 7045 •



12 Processing elements

Peak performance: 518.4GOPS@150MHz







## Evaluation: Performance of Aristotle Architecture

#### Runtime and performance<sup>\*1</sup> on TK1 and Zynq 7020



- Zyng 7020 consumes 20% 30% power of TK1 and costs less of TK1
- 1.78x higher performance on Zynq 7030 compared with Zynq 7020
- 4.94x higher performance on Zyng 7045 compared with Zyng 7020

Page 22

Aristotle architecture performs better when network is small but has limited peak performance



## **Evaluation: Platform and Benchmark for LSTM**

#### Platform Comparison ightarrow



### Nvidia K40 GPU

- 28nm  $\bullet$
- 2880 CUDA Cores
- 810MHz / 875MHz ullet
- 12GB GDDR5  $\bullet$
- Benchmark: Real-world LSTM for Speech Recognition ullet
  - Max matrix size: 4096\*1536
  - Consider scheduling of multiple matrixes
  - Consider non-linear functions
  - 100 frames per second



### Kintex Ultrascale Series

- 20nm •
- 4.75/9.49MB BRAM (KU060/115) ullet
- 2760/5520 DSP (KU060/115) ullet
- 300MHz  $\bullet$





### Evaluation: Performance and Resource Utilization of Descartes Architecture

#### Performance Comparison ullet

| Platform                | GPU K40*1   | FPGA KU060  |
|-------------------------|-------------|-------------|
| Dense or Sparse         | Dense       | Sparse (1   |
| Frequency               | 810/875 MHz | 300         |
| Precision               | FP32        | FIXED-4     |
| Threads to be Supported | Not limited | 2 (Separate |
| Peak Performance        | 4.29 TFOPS  | 4.8 TOPS*3  |
| Real Power              | 235W        | 30 – 35W    |



Page 24

\*4 960 GOPS for dense LSTM \*<sup>3</sup> 480GOPS for dense LSTM

**Resource Utilization** ullet



#### KU060

|       | LUT    | FF     | BRAM | DSP  |
|-------|--------|--------|------|------|
| Total | 331680 | 663360 | 1080 | 2760 |
| Used  | 298875 | 446655 | 1011 | 1505 |
| Ratio | 90%    | 67%    | 94%  | 55%  |

KU115 

|       | LUT    | FF      | BRAM | DSP  |
|-------|--------|---------|------|------|
| Total | 663360 | 1326720 | 2160 | 5520 |
| Used  | 563403 | 848990  | 1155 | 2529 |
| Ratio | 85%    | 64%     | 54%  | 46%  |

**K**40

**KU060** 

■ KU115



#### DeePhi: Making deployment of deep learning algorithms simple and efficient

### Automatic compilation tool

- Deep compression
- Activation quantization •
- Compiler •
- Aristotle: Architecture for CNN acceleration
- Descartes: Architecture for sparse LSTM acceleration \_\_\_\_

Evaluation boards will be shipped in Oct 2016 Apply for test at partner@deephi.tech

New architecture for CNN revealed in Q4 2016

### Conclusion





Page 26

### Live demo at Poster Session



## Song Yao Founder & CEO songyao@deephi.tech

# Thank You!

### About us

- www.deephi.com
- Collaborate with us
- partner@deephi.tech

### Join us

- dream@deephi.tech



