# Trade-off Considerations and Performance of Intel's MMX<sup>™</sup> Technology

Uri Weiser
Intel Corporation
Israel Design Center

August 20, 1996



## **Agenda**

- The Opportunity
- Definition Consideration
- MMX™ Technology
- Performance/Example
- Conclusions

## **The Opportunity**

- Emergence of new applications
  - Multimedia
  - Communication
- The need for performance
- Utilization of existing hardware
  - Datapath
  - Registers
  - -Internal buses



### **Evolution of the PC**

- The trend: Multimedia and communications applications are driving the market
  - Video & audio compression (DVD, MPEG2, AC3), games (3D graphics), speech recognition, voice compression, image processing, video conferencing (POTS and LAN)

The Home PC

Same as office

The Multimedia PC

- Audio
- CD-ROM
- Graphics accelerator
- Modem





## Characteristics of Multimedia and Communication Apps/Algorithms

- Multimedia and communication applications built from basic algorithms "glued" together
- Large degree of commonality across the diverse algorithms
  - Computation intensive
  - Data streaming
  - -Small data types



Potential data parallelism

### **Definition Consideration**

- Full compatibility with existing Intel architecture software model
- Significant performance benefit
- General and flexible/not specific
- Minimal die size impact and design complexity

## Compatibility

#### Requirements:

- Map into existing Intel Architecture
  - No new machine state
  - No new events
  - Availability of unused Op Code space

#### Approach:

- Use of FP registers structure (80/64 bit vs. 32 bit integer registers)
- No new exceptions
- Alias of FP OS handling mechanisms



## **Generality/Extensibility**

- Define Atomic operations
  - Arithmetic add, sub, shift, mul, compare
  - Logic and, andnot, or, xor
  - Conversion
  - Exception: Muladd
- Straight forward migration into Intel's future processors

## Intel's MMX<sup>TM</sup> Technology

- 57 new Instructions
  - Single Instruction Multiple Data Architecture technique (SIMD)
  - Fixed point integer
- Map into 8 FP registers/direct access
- No new exceptions
- Low implementation complexity

The Most Significant Enhancement to Intel Architecture Since the i386™ Processor

intel

## **Data Types**

- Packed bytes
  - Mainly for graphics and video
- Packed words
  - Used mainly for audio and comm.
- Packed doublewords
  - General purpose use
- Quadword
  - Bitwise operations and Data alignment





MMX<sup>TM</sup> Architecture Summary



**Packed Multimedia data** 



A Compatible Extension Architecture

## Sample MMX<sup>TM</sup> Technology Operations

#### **Saturating Arithmetic**

|   | <b>a</b> 3 | a2    | a1    | FFFFh |
|---|------------|-------|-------|-------|
| _ | +          | +     | +     | +     |
|   | <b>b</b> 3 | b2    | b1    | 8000h |
| Ī | a3+b3      | a2+b2 | a1+b1 | FFFFh |

#### **Parallel Compares**

| 23    | 45    | 16    | 34    |
|-------|-------|-------|-------|
| gt?   | gt?   | gt?   | gt?   |
| 31    | 7     | 16    | 67    |
| 0000h | FFFFh | 0000h | 0000h |

#### 16b x 16b => 32b Multiply Add



#### **Data Conversion**





## What Is A Parallel Compare?

- No flags to store multiple results
- Result is a mask





## **Conditional Selection**

intel®



### **PMADD**

pmaddwd

MM1, MM4

## packed multiply and add 4 words to 2 doublewords







## MMX<sup>™</sup> Technology Code Example

Inverse Discrete Cosine Transform (IDCT) Scalar vs. MMX™ Technology

Used in Video compression/decompression standards\*

## Basic Operation: A butterfly



 $O_0$ 

$$O_0 = I_0 + I_1$$

 $O_1$ 

$$O_1 = I_0 - I_1$$

#### Scalar

| <u>MMX</u> | <u>Techn</u> e | ology |
|------------|----------------|-------|
|            |                |       |

| mov               | eax, [edi] /* load 1st value                                                       | Movq                   | mm4, [edi]             | /* load 1st 4 values /* load 2nd 4 values                                              |
|-------------------|------------------------------------------------------------------------------------|------------------------|------------------------|----------------------------------------------------------------------------------------|
| mov               | ebx, [edx] /* load 2nd value                                                       | Movq                   | mm7, [edx]             |                                                                                        |
| mov<br>sub<br>add | eax, esi /* Copy 1st value<br>eax, ebx /* O1 = I0 - I1<br>esi, ebx /* O0 = I0 + I1 | Movq<br>Psubw<br>Paddw | %mm4,%mm0<br>%mm7,%mm4 | /* Copy 1st values<br>/* O1[0-3] = I0[0-3] - I1[0-3]<br>/* O0[0-3] = I0[0-3] + I1[0-3] |

Same Operation, Same Amount of Instructions
MMX Technology 4X Faster



## MMX<sup>TM</sup> Technology Performance





<sup>\*</sup> Measured: Components of Intel's Media Benchmark

<sup>\*\*</sup> Estimated. Based on inner loops and algorithm analysis

### Conclusions

- Implementation shows full compatibility with existing OS and applications
- Simple definition clean implementation
- Performance improvement of multimedia application 1.5- 5X