# **A MIMD Multi Threaded Processor**

#### Falk Lesser

V. Angelov, J. de Cuveland, V. Lindenstruth, C. Reichling, R. Schneider, M.W. Schulz Kirchhoff Institute for Physics University Heidelberg, Germany Phone: +49 6221 54 4304 Email: ti@kip.uni-heidelberg.de Email(speaker): lesser@kip.uni-heidelberg.de WWW: http://www.ti.uni-hd.de



#### Outline

- Introduction
- Application for the MIMD processor
- Electronics environment
- The MIMD processor architecture
- Summary

# **High Energy Physics**

• Study of fundamental constituents of matter and forces between them

» quarks, electrons, neutrinos, photon, Z<sup>0</sup>, W<sup>±</sup>, etc.

- Higher and higher energies are required to delve deeper and deeper into matter
- i.e. larger, more powerful, more expensive
  - No longer affordable by individual countries





### **Large Hadron Collider & The Alps**



# **Nucleons In Collision**



- 8000 collisions per second
- Each interaction generating
  >24 000 particles in acceptance of detector
- Task: Find within one specific particle pair within 6 µs out of 16 000 charged particles
- 6 µs to:
  - digitize 1.2 million data channels @ 10 MHz/10 Bit
  - process 29 Mbytes
  - form global decision

#### Main goal is to create the Quark-Gluon Plasma, A state of matter existing during the first few microseconds after the big bang

# **A Particle Detector named ALICE**



- ITS: 2.62 million channels (1 Bit)
- TPC: 570 000 channels (10 Bit)
- TRD: 1.2 million channels (10 Bit)
  - 1,425 million ADCs
  - Digitization rate: 10 MHz/10 Bit
  - Peak data rate: 17,8 TB/sec
  - 75 000 MIMD processors
  - Computing time of 6 µs
  - Target clock rate: 120 MHz

Measures particle trajectories, momentum and provides particle identification

Magnet (0,4 Tesla)

## **TRD Electronics Overview**



F. Lesser, www.ti.uni-hd.de

# **Tracklet Fit Concept**



# **TRD Trigger Timing and Data Flow**



- Highly I/O bound:
- Very tight time budget:

17.8 TB/sec 1.8 + 1.5 μs

### **Next Step: Compute and Combine**



## **Computer Architecture Trends**



Culler, Patterson, UC Berkeley

# Why MIMD with Shared Memory?

- 1.5 µs computing time @ 120 MHz  $\Rightarrow$  180 clock cycles
- $\sim$ 120 clock cycles required to finish one task
- Four tasks have to be done  $\Rightarrow \sim 480$  cycles needed
- The processor must execute up to four tasks on different data objects
- Arithmetic operations should be executed in one clock cycle
- Data has to be shared without overhead  $\Rightarrow$ 
  - Quad Ported SRAM (QPM)
  - Global Register File (GRF)
- Four tasks  $\Rightarrow$  Four CPUs
- Same code  $\Rightarrow$  Quad port instruction memory saves chip area

# • MIMD

## **Generic Processor Architecture**



- To execute up to four tasks in real time, four CPUs are needed
- Independent CPUs can't share data or program

# **MIMD Architecture**



F. Lesser, www.ti.uni-hd.de

# **Some Features**

- 16 Bit data word
- 16 private registers per node
- 16 global registers common to all nodes
- 24 Kbytes quad port

# **Instruction Set and Format**

- 24 Bit fixed length instruction word
- 70 instructions in total
  - 22 ALU instructions
  - 26 branch instructions
  - 3 instructions for synchronization
  - 14 Load/Store instructions
  - 4 instructions to handle interrupts

| 23                   | 17                                                                                                                           | 16                             | 11                        | 10     |          | 54       |                |                       | 0     |  |
|----------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------|---------------------------|--------|----------|----------|----------------|-----------------------|-------|--|
|                      | Opcode                                                                                                                       | Sour                           | Source 1                  |        | Source 2 |          | De             | estination            | ation |  |
| 23                   | 17                                                                                                                           | 14                             | 11                        | 10     |          | 54       |                |                       | 0     |  |
|                      | Opcode                                                                                                                       | Imm                            | ediate                    | Sour   | ce 2     |          | Destinatio     |                       |       |  |
| 23                   | 17                                                                                                                           | 5 4                            |                           |        |          |          | 0              |                       |       |  |
|                      | Opcode                                                                                                                       |                                | ediate                    |        |          |          | Destination    |                       |       |  |
| 23                   | 17 16 11 10                                                                                                                  |                                |                           |        |          |          |                | 0                     |       |  |
|                      |                                                                                                                              |                                |                           |        |          |          |                |                       | -     |  |
|                      | Opcode                                                                                                                       | Sour                           | ce 1                      |        | Im       | mec      | liat           | e                     | -     |  |
| 23                   | Opcode 17                                                                                                                    | Sour                           | ce 1                      |        | Im       | mec<br>4 | liat<br>3      | e                     | 0     |  |
| 23                   | Opcode 17<br>Opcode                                                                                                          | Sour<br>16                     | ce 1<br>Imm               | ediate | Im       | mec<br>4 | liat<br>3      | e<br>Branch           | 0     |  |
| 23<br>23<br>23       | Opcode 17<br>Opcode 17                                                                                                       | Sour<br>16<br>16               | ce 1  <br>Imm<br>11       | ediate | Im       | mec<br>4 | liat<br>3      | e<br>Branch           | 0     |  |
| 23<br>23<br>23       | Opcode 17<br>Opcode 17<br>Opcode 17                                                                                          | Sour<br>16<br>16<br>Sour       | ce 1<br>Imm<br>11<br>ce 1 | ediate | Im       | 4        | liat<br>3<br>3 | e<br>Branch<br>Branch | 0     |  |
| 23<br>23<br>23<br>23 | Opcode      17        Opcode      17        Opcode      17        Opcode      17        Opcode      17        Opcode      17 | Sour<br>16<br>16<br>Sour<br>15 | Imm    11    ce 1         |        | Im       | 4        | liat<br>3<br>3 | e<br>Branch<br>Branch |       |  |

| <br> |  |
|------|--|

# **Quad Port Memory**





F. Lesser, www.ti.uni-hd.de

# A Closer Look Inside the MPM

• Full custom design used for quad port 16 bit data memory and 24 bit prepower instruction memory Precharge Delivers/receives data to/from 4 CPUs not\_bit simultaneously \* \* \* \* ij 1 Bit Max. access time is about 2 ns vdd (0.18 µm) Bitline1-4Not bitline1-4 Needed access time 6 ns • Organized in blocks of 64 lines bit • Line width is parametrizable gnd 1 Bit  $\star$   $\star$   $\star$ \* \* \* \* bit vdd Senseamplifiers Dout1 Dout2 Dout3 Dout4 🚽 gno Four blocks of the MPM with additional test logic in 0.35 µm process

word4 word3 word2

not\_bit

out 🖌

# **Synchronization**

- Three instructions for synchronization
  - SEM sets the synchronization mask
  - **SYN** suspend the PC
  - SYT copies the synchronization register
- Implementation of flexible synchronization patterns
- Synchronization implemented as a side effect of access to the GRF



GRF

# Data I/O



- Direct input from preprocessor via multi ported registers
- Private I/O to external links
- System bus for peripheries
- Arbiter to select a CPU

| <br>— | <br> | - | <br>- |  |
|-------|------|---|-------|--|

# **Summary**

- High Energy Physics presents very interesting challenges in the computer science (high throughput, low latency, low overhead, massively parallel processing)
- Use of Multi Ported Memories (MPM) enables integration of multiple generic processor cores as MIMD unit
- MPM as global register file allows zero overhead, asynchronous multi processor communication (semaphores, locks, etc.)
- MPM operating as buffer memory provides scalable, independent access to shared data structures
- MPM operating as buffered crossbar switch allows tight integration of I/O for network and I/O processors
- For further information contact www.ti.uni-hd.de