#### A New Distributed DSP Architecture Based on the Intel IXS for Wireless Client and Infrastructure

Ernest Tsui Communications and Interconnect Technology Lab Corporate Technology Group

> Kumar Ganapathy Edge Access Division Intel Communications Group

**Contributors:** Hooman Honary, Rich Nicholls, Tony Chun, and Lee Snyder

HOT CHIPS 14 Session 7: Digital Signal Processors



20 August 2002

## Outline

- Vision for Wireless Networks ubiquitous
- Anticipated Issues plethora of "standards"
- Future Wireless Requirements "soft" with intelligence to increase capacity
- Architectural Objectives flexible and low power
- How will we go about it? distributed at the "right granularity"
- Distributed Architectural Summary based on power, size, and wireless protocols we can derive a "good" (near optimal?) distributed architecture
- Comparison to other Wireless DSP research flexible but within 2x of Berkeley Research Wireless Center's Pleiades Arch.
- Summary infrastructure architecture is near-optimum in granularity and power
- Next Steps client architecture next



## Vision for Wireless Networks

- -Ubiquitous Internet Connections for all Mobile Client Devices
  - Handhelds, PDAs, Tablet PCs, and Laptops
  - Always-on
- -New Paradigm for Wireless Basestations
  - Proliferation of basestations due to lack of spectrum
  - Agility across Multiple Bands
  - Multi-Network (WLAN, WWAN)



## **Anticipated Future Issues**

#### Wireless Protocol Plethora

- -PAN, WLAN, and WAN
  - -PAN: Bluetooth (UWB, Wireless USB2)
  - WLAN (4 protocols): 802.11b/a (11g, Hiperlan II)
  - -WAN (9 protocols):
    - 2G: IS-95, GSM
    - 2.5G: GPRS/EGPRS, cdma2000
    - 3G: WCDMA (FDD, TDD, SC), CDMA 1xE DV



## Wireless Requirements Summary

- Soft Radios at Basestations (deployed initially)
  - Low Power (<1 W) but highly flexible</p>
  - Large no. of channels per core
  - Scaleable
- Reconfigurable Client Radios (deployed later)
  - Seamless Client Roaming
    - Two Concurrent Wireless Protocols
    - Selected 802.11a and WCDMA as the most intensive protocols
  - Variable User Environments require "adaptive" resource allocation
  - Adaptive to Broadband AFE distortions
  - Very Low Power (<< 1W)</p>
    - Digital Baseband is < 10% of total PHY pwr
  - Reconfigurable to allow Si Re-use
  - Scaleable



#### **802.11a Signal Processing Flow Example**



#### **802.11a Initial Acquisition Flow**



**Hooman Honary** 

For new Packet Communications schemes – significant processing goes on during very short intervals of the preambles

int

7



**Tony Chun and Hooman Honary** 

The high data rates in 3G result in multi-code, -antenna, and -despreader (finger) processing requirements

inte

## Computational Mix for Wireless Protocols



• Miscl. DSP ~ 10 separate signal processing threads



# How will we go about it?

int<sub>el</sub>.



## **Present Status of Soft Radios**

- Prior Infrastructure Approaches
   DSP + ASIC
  - Inflexible ASIC and Costly DSP
  - DSP + Closely Coupled Accelerators
    - Increased Power and Costly DSP
  - Reconfigurable
    - Hard to Program
    - Costly
    - High Power
    - Granularity problem has not been completely solved
- Need Evolved Architecture



## **Architectural Objectives**

#### -Client:

- 2-3x Power/Size of Dedicated Hardware for the most intensive protocol as a goal
- Related to no. of protocols possibly in the client device
- -Basestation:
  - 5-10x Power/Size of Dedicated Hardware for the most intensive protocol as a goal
  - Related to no. of protocols possibly in the infrastructure device



## **General Architectural Issues**

- –Low power requires a highly distributed architecture
  - Low voltage helps quadratically lower power
  - Low clock frequency linearly lowers power
  - Large size penalties associated with distributed elements must be avoided
- –What is the low power interconnect strategy?
- –How do we simplify the distributed processor programming problem?



## **Architecture Approach**

- Investigate Homogeneous Processing Elements (PE)
  - Easy to Scale and to Program for Basestations
  - Heterogeneous better for Client

#### Interconnect with Nearest Neighbor Mesh

- Eliminates High Speed (and power) buses [J. Rabaey, Silicon Architectures for Wireless, Hotchips 2001 Tutorial]
- PHY connections are 95% nearest neighbor

#### Number of Distributed Processing Elements

- Driven by:
  - Computational Load
  - Size and Power Constraints
  - Feature parameters, e.g., Average Load Capacitance, Vdd, etc.

Type of Element

- General Purpose DSP combined with:
- Acceleration of "Standard Operations" with the right granularity
- S/W programming via High Level Language

– Explicitly indicates parallelism and connections

#### **System Architecture**



# Does a Good (near optimal) PE Solution Exist?

intel

## **Macro-architectural Constraints**

- First, must meet Power, Size, and Computational Load constraints
  - Computational Load =  $R_c$  (ops/sec.)
    - N<sub>op</sub> = No. of parallel significant operations (multiplies, etc.) in one cycle [R. Brodersen, ISSCC'02]
    - $F_{clk} = Clock frequency$
    - $N_{op} \ge F_{clk} > R_c$
  - Power Constraint =  $P_o$  (mW)
    - Power (dynamic, leakage (P<sub>leak</sub>), short circuit (Psc)) < Po
  - Size Constraint =  $A_c$  (mm2)
    - $N_{op} \ge A_{op} < A_{c}$
    - A<sub>op</sub> = Average area of a significant computational unit (e.g., multiplier-memory-address-decoder, etc.) (mm2)
    - Aop ~ Granularity Factor
  - Constraints on F<sub>clk</sub>
    - $R_c / N_{op} < F_{clk}$
    - $\overline{R_c \times A_{op}}/A_c < \overline{F_{clk}}$

## intel

### **Clock Rate Bounds**

#### - F<sub>clk</sub> is upper bounded by power constraints

- $a \ge C_{sw} \ge V dd^2 \ge F_{clk} + P_{leak} < P_o/(b \ge A_c)$ 
  - where  $P_{leak}$  is the average pwr leakage density in mW/mm2
  - $C_{sw}$  is the average switching (load) capacitance per mm2
  - 'a' is the activity factor
  - 'b' is the average active area (incl. Datapath, cache, cache memory bus, etc. and excl. L2 memory, etc.)
  - 'b' varies from ~ 10% for microprocessors to ~ 80% for dedicated hardware and also is a function of clock gating strategies

## – F<sub>clk</sub> is lower bounded by computational and area constraints

- $R_c \ge A_{op}/A_c < F_{clk} < (P_o/(b \ge A_c) P_{leak})/(a \ge C_{sw} \ge Vdd^2)$ - Key Issues:
  - Find the  $F_{clk}$  that meets upper and lower bounds
  - Derive the  $A_{op}$  and  $N_{op}$





## **Reconfigurable Power Trend Summary**

#### – There is an optimum $F_{clk}$ for a fixed $A_{op}$

- (Recall that  $A_{op}$  is the fundamental processing size)
- The optimum meets Size and Computational requirements and minimizes power for the above
- Higher  $F_{clk}$  increases power and lower  $F_{clk}$  increases area and interconnect power

#### – Is there a similar optimum as A<sub>op</sub> is Varied?

- As A<sub>op</sub> decreases interconnect Power increases exponentially
  - Simpler elements must be connected in a more complex manner to retain flexibility
- As A<sub>op</sub> increases the voltage requirement (and Power) increases
  - More complex element requires time-multiplexing

#### – Thus, is there a globally "good" design?

- Conjecture:
  - Determine the Minimum Aop (for the flexibility desired) and find the optimum  ${\rm F}_{\rm clk}$



## Example of "Good" Architecture Parameters in the optimum area

-N<sub>op</sub> (No. of parallel Significant operations), for 90 nm:
-N<sub>op</sub> ~ 50
-A<sub>op</sub> ~ 0.6 mm<sup>2</sup>
- Is this an optimum Granularity A<sub>op</sub>??
- F<sub>clk</sub> ~ 400 MHz
- P<sub>o</sub> ~ 750 mW
- R<sub>c</sub> ~ 20 GOPs



# Key Computing Element IXS Core

intel



#### **Architecture Summary** IXS Processor Octal MAC units - RISC-tightly coupled – Acceleration H/W - Viterbi/Turbo - Correlation, De-spreading, etc. - Filter – Parameters within the N<sub>op</sub> Range (50) - 5 PEs x 9 MACs = 45 MACs - 32 - 8 bit adders per PE Mesh-Connected to Surrounding **Processors (5 PEs total)** Do we have the optimal A<sub>op</sub>? -Lower A<sub>op</sub> will start to increase interconnect Power intal

# How does the IXS PE Compare against Dedicated Hardware?

intal

#### Power and Area Efficiency of IXS PE vs Dedicated H/W for WLAN Benchmark Still 5-7x Dedicated H/W



# How do we compare against other Reconfigurable Approaches?



#### How Does our Architecture Compare? Multi-User Detector Benchmark



**BWRC and Lee Snyder** 





## Summary

- Homogeneous Mesh-Connected Array of IXS Processing Elements for Infrastructure
  - Low power/size (5-7x dedicated h/w)
  - Flexibility where it's needed
  - Scaleability
  - For given size/power and feature size constraints a "good" solution can be found
  - Key Processing element
    - Minimum Memory
    - "Maximum-Datapath" Units
- Next Steps:
  - "What is the optimal A<sub>op</sub> Size?"
  - "What is the right Arch. for the Client?"

