# AMDA

# Applying AMD's "Kaveri" APU for Heterogeneous Computing

DAN BOUVIER, BEN SANDER AUGUST 2014

#### OUR DESIGN CHOICES

#### 



\*AMD TrueAudio technology is offered by select AMD Radeon™ R9 and R7 200 Series GPUs and select AMD A-Series APUs and is designed to improve acoustic realism. Requires enabled game or application. Not all audio equipment supports all audio effects; additional audio equipment may be required for some audio effects. Not all products feature all technologies—check with your component or system manufacturer for specific capabilities.

## A-SERIES REDEFINES COMPUTE

#### 



#### Kaveri

#### MAXIMUM COMPUTE PERFORMANCE

- Up to 12 compute cores\*
  - 4 "Steamroller" CPU cores
  - 8 GCN GPU cores
  - HSA enabled

#### ENHANCED USER EXPERIENCES

- Video acceleration
- AMD TrueAudio technology
- 4 display heads

#### **HIGH PERFORMANCE CONNECTIVITY**

- 128bits DDR3 up to 2133
- PCI-Express<sup>®</sup> Gen3 x16 for discrete graphics upgrade
- PCI-Express<sup>®</sup> for direct attach NVMe SSD

\* For more information visit amd.com/computecores

3 | Applying AMD's "Kaveri" APU for Heterogeneous Computing | HOT CHIPS 26 - AUGUST 2014 |

"KAVERI"





#### **Die Size:** 245mm<sup>2</sup>

#### Transistor count: 2.41 Billion

#### Process: 28nm

#### IMPROVEMENTS FOR AMD'S DUAL CPU COMPUTE CORES



## "KAVERI" GPU – GRAPHICS CORE NEXT ARCHITECTURE





**47%** of "Kaveri" is dedicated for GPU

- 8 compute units
   (512 IEEE 2008-compliant shaders)
- Device flat (generic) addressing support

- Masked Quad Sum of Absolute Difference (MQSAD) with 32b accumulation and saturation
- Precision improvement for native LOG/EXP ops to 1ULP



6 | Applying AMD's "Kaveri" APU for Heterogeneous Computing | HOT CHIPS 26 - AUGUST 2014 |

# "KAVERI" APU ENHANCEMENTS





7 | Applying AMD's "Kaveri" APU for Heterogeneous Computing | HOT CHIPS 26 - AUGUST 2014 |

#### INTRODUCTION OF HARDWARE COHERENCY FOR THE GPU

- Coherent Hub (CHUB)
  - Compute traffic steered to dedicated coherent transaction path
  - Includes IOMMU Address Translation Cache (ATC)
  - Selectively probe CPU caches based on page attribute

#### Atomics

- Single cycle request
- One at a time (no gathering at the SYS level)
- All atomics return the original data from DRAM (success of conditional)
- TYPES Supported:
  - Test and OR, Swap, Add, Subtract, AND, OR, XOR, Signed Min, Signed Max, Unsigned Min, Unsigned Max, Clamping Inc, Clamping Dec, Compare and Swap



#### 

#### IOMMUv2 – ELIMINATES DOUBLE COPY

#### With traditional memory system

- Not all GPU memory is CPU accessible (e.g. local frame buffer memory)
- Local frame buffer may not be large enough working space
- Lack of demand-paging support
- Alignment limitations
- Data copied from unpinned to pinned regions

#### With IOMMUv2

- Eliminate CPU & DMA copy operations (in both directions!!)
- GPU operates on unpinned region directly





#### IOMMUV2 PERIPHERAL PAGE FAULTS





# QUEUING (TODAY'S PICTURE)

▲ High overhead to pass work to/from GPU





#### ▲ Shared Virtual Memory



- ▲ Shared Virtual Memory
- System Coherency



- Shared Virtual Memory
- System Coherency
- ▲ Signaling



- Shared Virtual Memory
- System Coherency
- ▲ Signaling
- User Mode Queuing



- Shared Virtual Memory
- System Coherency
- Signaling
- User Mode Queuing



#### HSA IN A NUTSHELL

#### AMD

#### HSA Hardware Building Blocks

#### Shared Virtual Memory

- Single address space
- Coherent
- Pageable
- Fast access from all components
- Can share pointers
- Architected User-Level Queues

#### Signals

*Provide industry-standard, architected requirements for how devices share memory and communicate with each other* 

#### HSA Software Building Blocks

#### HSAIL

- Portable, parallel, compiler IR
- 🖌 HSA Runtime
  - Create queues
  - Allocate memory
  - Device discovery
- Reference High-level Compiler
  - CLANG/LLVM
  - Generate HSAIL
  - OpenCL, C++AMP

*Provide industry-standard compiler IR and runtime to enable existing programming languages to target the GPU* 

#### **EVOLUTION OF THE SOFTWARE STACK**



#### WHAT IS HSAIL?

- ▲ Intermediate language for parallel compute in HSA
- Generated by a "High-level Compiler" (GCC, LLVM, Java VM, etc)
- Expresses parallel regions of code
- Portable across vendors and stable across product generations
- ▲ Goal: Bring parallel acceleration to mainstream programming languages (OpenMP, C++AMP, Java)



#### PROGRAMMING LANGUAGES PROLIFERATING ON HSA

#### 



#### HSA ENABLEMENT OF JAVA

#### JAVA 9 – HSA-ENABLED JAVA (SUMATRA)

- Adds native APU acceleration to Java Virtual Machine (JVM)
- Developer uses Lambda, Stream API
- ▲ JVM generates HSAIL automatically



## USE CASES SHOWING HSA ADVANTAGE ON KAVERI

| Programming<br>Technique | Use Case Description                                                                                                              | HSA Advantage                                                                                                                                                   |
|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Data Pointers            | <b>Binary tree searches</b><br>GPU performs searches in a CPU-created binary tree                                                 | GPU can access existing data structures containing pointers<br>Higher performance through parallel operations                                                   |
| Platform Atomics         | <b>Binary tree updates</b><br>CPU and GPU operating simultaneously on the tree, both<br>doing modifications                       | CPU and GPU can synchronize using Platform Atomics<br>Higher performance through parallel operations                                                            |
| Large Data Sets          | <b>Hierarchical data searches</b><br>Applications include object recognition, collision detection,<br>global illumination, BVH    | GPU can operate on huge models in place<br>Higher performance through parallel operations                                                                       |
| CPU Callbacks            | <b>Middleware user-callbacks</b><br>GPU processes work items, some of which require a call to<br>a CPU function to fetch new data | GPU can invoke CPU functions from within a GPU kernel<br>Simpler programming does not require "split kernels"<br>Higher performance through parallel operations |

# Data Pointers

#### DATA POINTERS

Legacy





#### DATA POINTERS

Legacy

| SYSTEM MEMORY |                  |
|---------------|------------------|
| L             |                  |
|               |                  |
|               |                  |
| TREE          | RESULT<br>BUFFER |



#### DATA POINTERS

Legacy



| GPU        |      |               |  |  |
|------------|------|---------------|--|--|
| KERNEL     |      |               |  |  |
| GPU MEMORY |      |               |  |  |
|            |      |               |  |  |
| FLAT       | TREE | RESULT BUFFER |  |  |





#### DATA POINTERS - CODE COMPLEXITY



static void run\_ocl\_path()

/\* Allocation and initialization \*/ tree = (node \*) malloc(num nodes \* sizeof(node)); initialize nodes(tree, num nodes); root = construct BST(num nodes, tree);

search\_keys = (int \*) malloc(num\_search\_keys \* sizeof(int)); initialize\_search\_keys(search\_keys, num\_search\_keys, sort\_input); AMD

found\_keys = (int \*) malloc(num\_search\_keys \* sizeof(int)); memset(found\_keys, 0, num\_search\_keys \* sizeof(int));

ocl tree = (ocl node \*) malloc(num nodes \* sizeof(ocl node));

cl\_mem\_cl\_search\_keys = clCreateBuffer(context, CL\_MEM\_READ\_ONLY,

num search keys \* sizeof(int), NULL, &status); cl mem cl found nodes id = clCreateBuffer(context, CL MEM WRITE ONLY, num\_search\_keys \* sizeof(int), NULL, &status);

/\* The tree is converted to its array form \*/ int root id: initialize ocl nodes(ocl tree, num nodes); convert\_tree\_to\_array(root, ocl\_tree, &root\_id);

/\* Copy the tree and search keys array to the GPU \*/ clEnqueueWriteBuffer(queue, cl ocl tree, CL TRUE, 0, num\_nodes \* sizeof(ocl\_node), ocl\_tree, 0, NULL, NULL);

clEngueueWriteBuffer(gueue, cl search keys, CL TRUE, 0, num\_search\_keys \* sizeof(int), search\_keys, 0, NULL, NULL);

#### /\* GPU work enqueue \*/

clSetKernelArg(search\_kernel, 0, sizeof(cl\_ocl\_tree), &cl\_ocl\_tree); clSetKernelArg(search\_kernel, 1, sizeof(cl\_int), &root\_id); clSetKernelArg(search\_kernel, 2, sizeof(cl\_search\_keys), &cl\_search\_keys); clSetKernelArg(search kernel, 3, sizeof(cl int), &num search keys); clSetKernelArg(search\_kernel, 4, sizeof(cl\_found\_nodes\_id), &cl\_found\_nodes\_id);

clEnqueueNDRangeKernel(queue, search\_kernel, 1, NULL, &num\_search\_keys, &preferredLocalSize, 0, NULL, NULL);

#### clFinish(queue);

/\* Copy the results back from the GPU \*/ clEnqueueReadBuffer(queue, cl\_found\_nodes\_id, CL\_TRUE, 0, num\_search\_keys \* sizeof(int), found\_keys, 0, NULL, NULL);

/\* Cleanup \*/ free(ocl\_tree); free(tree); free(found keys); free(search keys);

clReleaseMemObject(cl\_ocl\_tree); clReleaseMemObject(cl\_search\_keys); clReleaseMemObject(cl\_found\_nodes\_id);

static void initialize ocl nodes(ocl node \*ocl tree, long long int num nodes)

for (int i = 0; i < num\_nodes; i++) {</pre> ocl tree[i].left = -1: ocl\_tree[i].right = -1;

static void convert tree to array(node \*root, ocl node \*ocl tree, int \*root id)

node \*\*tree\_queue; node \*tmp;

tree\_queue = (node \*\*)calloc(num\_nodes, sizeof(node \*));

long long int front = 0; long long int rear = 0;

tree gueue[rear] = root: ocl\_tree[rear].value = root->value; rear++:

\*root\_id = 0;

while (front != rear) { tmp = tree\_queue[front];
if (!tmp) break

> if (tmp->left) { tree\_queue[rear] = tmp->left; ocl\_tree[rear].value = tmp->left->value; ocl\_tree[front].left = (int)rear; rear++;

> if (tmp->right) { tree\_queue[rear] = tmp->right; ocl tree[rear].value = tmp->right->value; ocl\_tree[front].right = (int)rear; rear++;

front++;

if (tree\_queue) free(tree\_queue);

#### DATA POINTERS - PERFORMANCE



**Binary Tree Search** 60,000 50,000 40,000 **v 4**0,000 **v 4** 🞽 CPU (1 core) 📔 CPU (4 core) **Search rate** 500000 📕 Legacy APU 📕 HSA APU 10,000 0 1M 5M 10M 25M Tree size (# nodes)

\*Measured in AMD labs<sup>2</sup>

#### 30 | Applying AMD's "Kaveri" APU for Heterogeneous Computing | HOT CHIPS 26 - AUGUST 2014 |

# Platform

Atomics \_

#### PLATFORM ATOMICS

## HSA and full OpenCL 2.0

Both CPU+GPU operating on same data structure **concurrently** 



INPUT BUFFER



#### PLATFORM ATOMICS

#### HSA and full OpenCL 2.0



#### AMD'S UNIFIED SDK

- ▲ Access to AMD APU and GPU programmable components
- Component installer choose just what you need
- Initial release includes:
- APP SDK v2.9
- Media SDK 1.0



#### AMD Unified SDK

| APP SDK 2.9                                                                                                                                                                                                                                                                                                                                          | MEDIA SDK 1.0                                                                                                                                                                                                                                  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>✓ Web-based sample browser</li> <li>✓ Supports programming standards: OpenCL<sup>™</sup>, C++ AMP</li> <li>✓ Code samples for accelerated open source libraries:         <ul> <li>OpenCV, OpenNI, Bolt, Aparapi</li> </ul> </li> <li>✓ OpenCL<sup>™</sup> source editing plug-in for visual studio</li> <li>✓ Now supports Cmake</li> </ul> | <ul> <li>GPU accelerated video pre/post processing library</li> <li>Leverage AMD's media encode/decode acceleration blocks</li> <li>Library for low latency video encoding</li> <li>Supports both Windows Store and classic desktop</li> </ul> |

34 | Applying AMD's "Kaveri" APU for Heterogeneous Computing | HOT CHIPS 26 - AUGUST 2014 |

#### ACCELERATED OPEN SOURCE LIBRARIES

| OpenCV                                                                                                                       | Bolt                                                                                                                                                                                                  | clMath                                                                                                                                                         | Aparapi                                                                                                                         |
|------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>▲ Most popular computer vision library</li> <li>▲ Now with many OpenCL<sup>™</sup> accelerated functions</li> </ul> | <ul> <li>C++ template library</li> <li>Provides GPU off-load for<br/>common data-parallel<br/>algorithms</li> <li>Now with cross-OS support<br/>and improved<br/>performance/functionality</li> </ul> | <ul> <li>AMD released APPML as open source to create clMath</li> <li>Accelerated BLAS and FFT libraries</li> <li>Accessible from Fortran, C and C++</li> </ul> | <ul> <li>OpenCL accelerated Java 7</li> <li>Java APIs for data parallel<br/>algorithms (no need to<br/>learn OpenCL)</li> </ul> |

#### **KAVERI OPENS THE GATES TO PERFORMANCE**

#### 



visibility into entire memory space

APU compute power

flexibility to be used to create and dispatch work items

#### END RESULT: HSA RESULTS IN MORE ENERGY EFFICIENT COMPUTATION



#### Many important workloads execute many times more efficiently using GPU compute

What does this mean for power?

- resources than CPU only – Egyideo indexing natural human
- E.g. video indexing, natural human interfaces, pattern recognition
- ▲ For the same power, much better performance → finish early (computation, web page render, display update) and go to sleep

#### AMD POWER EFFICIENCY IMPROVEMENTS

#### Power Reductions Over Time

#### Typical-Use Energy Efficiency

- Over the last 6 years (2008-2014), AMD achieved a 10x improvement in platform energy efficiency\*
- Enabled by:
  - Intelligent dynamic power management
  - Further integration of system components (NorthBridge, GPU, SouthBridge)
  - Silicon power optimizations

Energy use drops

– Process scaling improvements



#### While performance increases

= Increased efficiency

\*Typical-use Energy Efficiency as defined by taking the ratio of compute capability as measured by common performance measures such as SpecIntRate, PassMark and PCMark, divided by typical energy use as defined by  $E_{TEC}$  (Typical Energy Consumption for notebook computers) as specified in Energy Star Program Requirements Rev 6.0 10/2013

#### HSA A KEY ENABLER FOR AN ENERGY EFFICIENT ROADMAP





# SUMMARY

With HSA features, Kaveri is an optimized platform for Heterogeneous Computing

- HSA features make Kaveri the FIRST full OpenCL 2.0 capable chip
  - Fine Grained SVM (hUMA)
  - C11 Atomics (Platform Atomics)
  - Dynamic Parallelism (hQ)
  - Pipes (hUMA, hQ)



<sup>1</sup>Testing by AMD Performance labs using 3DMark Sky Diver as of July 3, 2014. AMD A10-7850K with 2x8GB DDR3-2133 memory, 512GB SSD, Windows 8.1, Driver 14.20.1004 Beta 11 scored 5523. AMD A10-6800K with 2x8GB DDR3-2133 memory, 512GB SSD, Windows 8.1, Driver 14.20.1004 Beta 11 scored 3796.

<sup>2</sup> System configuration "Kaveri" A10- 95W TDP, CPU Speed 3.7GHz/4.0 GHz, GPU speed 720MHz, Memory 2x4GB DDR3-1600, Disk HDD, Video Driver 13.35/HAS Beta 2.2, test dates January 1-3, 2014, Microsoft Windows 8.1 (64-bit)

#### **DISCLAIMER & ATTRIBUTION**

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

#### **ATTRIBUTION**

© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. PCI Express is a registered trademark of PCI-SIG Corporation. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.

42 | Applying AMD's "Kaveri" APU for Heterogeneous Computing | HOT CHIPS 26 - AUGUST 2014 |