

# KNIGHTS MILL: NEW INTEL PROCESSOR FOR Machine Learning

Dennis Bradford, Sundaram Chinthamani, Jesus Corbal, Adhiraj Hassan, Ken Janik, Nawab Ali

## Legal Disclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com].

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

§ Configurations: see each performance slide notes for configurations.

§ For more information go to http://www.intel.com/performance.

Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

SPEC, SPECint, SPECfp, SPECrate, SPECpower, SPECjbb, SPECompG, SPEC MPI, and SPECjEnterprise\* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information.

The cost reduction scenarios described in this document are intended to enable you to get a better understanding of how the purchase of a given Intel product, combined with a number of situation-specific variables, might affect your future cost and savings. C Nothing in this document should be interpreted as either a promise of or contract for a given level of costs.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Intel, the Intel logo, Intel Xeon, Xeon logo, Intel Xeon Phi logo, and the Look Inside. Logo are trademarks of Intel Corporation in the U.S. and/or other countries.

\*Other names and brands may be claimed as the property of others.



### What is Machine Learning?

#### **CLASSIC ML**

Using functions or algorithms to extract insights from new data



**DEEP LEARNING** 

Using massive data sets to train deep (neural) graphs that can extract insights from new data



\*Not all classical machine learning algorithms require separate training and new data sets



# **DATACENTER**

#### **ALL PURPOSE**



Intel<sup>®</sup> Xeon<sup>®</sup> Processor Family

### **MOST AGILE AI PLATFORM**

Scalable performance for widest variety of AI & other datacenter workloads – including deep learning training & inference

### HIGHLY-PARALLEL



Intel® Xeon Phi<sup>™</sup> Processor (Knights Mill†)

### **FASTER DL TRAINING**

Scalable performance optimized for even faster deep learning training and select highly-parallel datacenter workloads\*



**FLEXIBLE ACCELERATION** 

Intel® FPGA

#### **ENHANCED DL INFERENCE**

Scalable acceleration for deep learning inference in real-time with higher efficiency, and wide range of workloads & configurations







Crest Family<sup>†</sup>

### **DEEP LEARNING BY DESIGN**

Scalable acceleration with best performance for intensive deep learning training & inference

<sup>+</sup>Codename for product that is coming soon

All performance positioning claims are relative to other processor technologies in Intel's AI datacenter portfolio

Knights Mill (KN/M); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc. Il products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.





### What is Knights Mill?

- First Intel product targeted specifically at Deep Learning training workloads
  - Up to 4x DL Peak performance over Xeon Phi<sup>™</sup> 7200 Series<sup>1</sup>
- Built on top of 2<sup>nd</sup> generation Intel<sup>®</sup> Xeon Phi<sup>™</sup> processor
  - Improved efficiency
  - Optimized for scale-out
  - Enhanced variable precision
  - Flexible, high capacity memory





### Knights Mill – New Intel Processor for Deep Learning Designed for Deep Learning – "AI on IA"





### Knights Mill exploits all <del>3</del> 4 levels of parallelism





### New Deep Learning ISA: Quad FMA FP32

| Mnemonic  | Format                  | Description                                        |
|-----------|-------------------------|----------------------------------------------------|
| V4FMADDPS | zmm1 {k1}, zmm2+3, m128 | Quadruple packed single-precision multiply and add |
|           |                         |                                                    |

"source block" of 4 zmm sources Memory operand packing 4 scalars (4x32)

#### V4FMADDPS zmm4 {k1}, zmm0+3, m128

```
for i=0..15
zmm4.fp32[i] = zmm4.fp32[i]
+ zmm0.fp32[i]*m128.fp32[0]
+ zmm1.fp32[i]*m128.fp32[1]
+ zmm2.fp32[i]*m128.fp32[2]
+ zmm3.fp32[i]*m128.fp32[3]
```



### An Example: Using Quad FMA on Matrix Multiply





### Variable Precision: What is VNNI-16?

- Vector Neural Network Instructions
- Variable precision
  - Inputs: 16-bit INT
  - Outputs: 32-bit INT
- Variable precision is best of both worlds
  - Same operations/instruction as 'half precision'
    - 2x OPS vs Single Precision
  - Similar output precision for optimal training convergence
    - 31 bits of INT32 vs 24 bits of mantissa in FP32
  - The obvious trade-off is the associated overhead on handling dynamic range in software (fixed precision)





### QVNNI = QFMA + VNNI

| VP4DPWSSD  | zmm1 {k1},<br>zmm2+3, mem128 | Quadruple INT16 to INT32 horizontal MAC                        |
|------------|------------------------------|----------------------------------------------------------------|
| VP4DPWSSDS | zmm1 {k1},<br>zmm2+3, mem128 | Quadruple INT16 to INT32 horizontal MAC with signed saturation |

#### Example

Instruction Format

- VP4DPWSSD zmm4 {k1}, zmm0+3, m128
  - for i=0..15 ٠
    - zmm4.int32[i] = zmm4.int32[i]
      - + (zmm0.int16[2\*i]\*m128.int16[0] + zmm0.int16[2\*i+1]\*m128.int16[1])

16b

a1

**b**1

- + (zmm1.int16[2\*i]\*m128.int16[2] + zmm1.int16[2\*i+1]\*m128.int16[3])
- + (zmm2.int16[2\*i]\*m128.int16[4] + zmm2.int16[2\*i+1]\*m128.int16[5])

Description

+ (zmm3.int16[2\*i]\*m128.int16[6] + zmm3.int16[2\*i+1]\*m128.int16[7])



### **Knights Mill Core**

**ISA**: SSE, AVX, AVX512-F Double Precision stack

1 VPU port/core (512b)

Single Precision/VNNI stack

• 2 stacked FMAs per port





### Intel<sup>®</sup> Xeon Phi<sup>™</sup> 7200 Series vs. Knights Mill: Port Comparisons



13

### Quad FMA Double-pumped Execution (\*)

(\*) Included for illustration purposes, not intended as an exact recreation of KNM pipeline stages

#### The life of a Single-precision FMA instruction in Knights Mill

| FETCH | DEC | RAT | SCHED | AGU | MEM   | WB |      |      |      |    |
|-------|-----|-----|-------|-----|-------|----|------|------|------|----|
|       |     |     |       |     | SCHED | RF | EXEO | EXE1 | EXE2 | WB |

#### The life of a Single-precision QFMA instruction in Knights Mill



### Efficiency of double-pumped execution



1

**inte** 

### Knights Mill: Putting All Together

#### Knights Mill is Xeon Phi™ 7200 series Derivative

- Xeon Phi<sup>™</sup> 7200 series & Knights Mill share the same compute architecture
- Built for different markets
- Xeon Phi<sup>™</sup> 7200 series → HPC workloads
- Knights Mill → deep learning training workloads

series & Knights Mill are the <u>same</u> generation of Intel® Xeon Phi™ products

Xeon Phi<sup>™</sup> 7200

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel® microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel® does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel®. Microprocessors. Certain optimizations in this product are intended for use with Intel microprocessors. Certain optimizations no regarding the specific instruction sets covered by Intel®. Microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by Intel®. Notice Revision #20110804

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark and MobileMark, sere measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel® measured as of February 2017

#### Adding New Instruction Sets

- Knights Mill uses new instructions to adjust performance
- Compared to Xeon Phi<sup>™</sup> 7200 series:
- 2x single precision
- 1/2 double precision
- 4x using new QVNNI

Up to 4x\* performance over Knights Landing for Deep Learning workloads via QVNNI

#### Deep Learning Software Optimizations

- Intel is optimizing library & frameworks used for deep learning training
- Investments apply to Intel<sup>®</sup> Xeon<sup>®</sup> and Xeon Phi<sup>™</sup> processors, & FPGAs

S/W optimizations give up to 400x performance over non-optimized Intel products\*\*



# **INTEL AI PORTFOLIO**

## **AI FRAMEWORKS**

#### **SELECT YOUR FAVORITE AI FRAMEWORK**



Other names and brands may be claimed as the property of others.





## **INTEL LIBRARIES, FRAMEWORKS & TOOLS**

|                           | Intel® Math Kernel Library<br>Intel® MKL-DNN                                          |                                                                                                           | Intel® MLSL                                                                                                                   | Intel® Data<br>Analytics<br>Acceleration<br>Library<br>(DAAL)                                                             | python<br>Intel®<br>Distribution                                            | Den Source<br>Frameworks                                                                      | Intel Deep<br>Learning SDK                                                                                              | Intel® Computer<br>Vision SDK                                                                                                                    |
|---------------------------|---------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| High<br>Level<br>Overview | High performance<br>math primitives<br>granting low level<br>of control               | Free open source<br>DNN functions for<br>high-velocity<br>integration with<br>deep learning<br>frameworks | Primitive<br>communication<br>building blocks to<br>scale deep learning<br>framework<br>performance over a<br>cluster         | Broad data analytics<br>acceleration object<br>oriented library<br>supporting<br>distributed ML at the<br>algorithm level | Most popular and<br>fastest growing<br>language for<br>machine learning     | Toolkits driven by<br>academia and<br>industry for<br>training machine<br>learning algorithms | Accelerate deep<br>learning model<br>design, training and<br>deployment                                                 | Toolkit to develop &<br>deploying vision-<br>oriented solutions<br>that harness the full<br>performance of Intel<br>CPUs and SOC<br>accelerators |
| Primary<br>Audience       | Consumed by<br>developers of<br>higher level<br>libraries and<br>Applications         | Consumed by<br>developers of the<br>next generation of<br>deep learning<br>frameworks                     | Deep learning<br>framework<br>developers and<br>optimizers                                                                    | Wider Data Analytics<br>and ML audience,<br>Algorithm level<br>development for all<br>stages of data<br>analytics         | Application<br>Developers and<br>Data Scientists                            | Machine Learning<br>App Developers,<br>Researchers and<br>Data Scientists.                    | Application<br>Developers and Data<br>Scientists                                                                        | Developers who<br>create vision-<br>oriented solutions                                                                                           |
| Example<br>Usage          | Framework<br>developers call<br>matrix<br>multiplication,<br>convolution<br>functions | New framework<br>with functions<br>developers call for<br>max CPU<br>performance                          | Framework<br>developer calls<br>functions to<br>distribute Caffe<br>training compute<br>across an Intel®<br>Xeon Phi™ cluster | Call distributed<br>alternating least<br>squares algorithm<br>for a<br>recommendation<br>system                           | Call scikit-learn<br>k-means function<br>for credit card<br>fraud detection | Script and train a<br>convolution neural<br>network for image<br>recognition                  | Deep Learning<br>training and model<br>creation, with<br>optimization for<br>deployment on<br>constrained end<br>device | Use deep learning to<br>do pedestrian<br>detection                                                                                               |

Find out more at software.intel.com/ai



### Call to Action: Visit Intel Websites for more info

#### Intel.com/AI

- Stories and Use Cases
- Al Academy Access
- Al Product Overviews



#### Intelnervana.com

- Nervana Platform Info
- Optimization resources
- Nervana Graph
- Events & Partners





