





# GeForce 8800 Architecture Overview

## **Graphics pipelines for last 20 years** *Dedicated hardware per processing stage*

Vertex

Triangle

Pixel

ROP



© NVIDIA Corporation 2007

© NVIDIA Corporation 2007

Fixed Transform & Lighting, evolved to programmable vertex shading

Hot Chips 2007: The NVIDIA GeForce 8800 GPU

Triangle, point, line – setup

Flat shading, texturing, evolved to programmable pixel shading

Blending, Z-buffering, antialiasing

Wider and faster over the years CAGR DRAM bw: 1.4x raw, 2.0x with compression

Hot Chips 2007: The NVIDIA GeForce 8800 GPU



## **Unified Shader Processor Architecture**

# 

# GeForce 8800 has a unified shader processor architecture

- All shader stages use the same instruction set
- All shader stages execute on the same units vertex, geometry, pixel shaders
- Permits better sharing of expensive hardware shader resources
  - Building dedicated units often results in underutilization due to the workload of the application
    - When one unit becomes the bottleneck, other pipelined units are less efficient
  - Dynamic and static thread/resource load balancing





© NVIDIA Corporation 2007

# **Shader Model Progression**



|                     | DX8 SM1.x       | DX9 SM2 | DX9 SM3    | DX10 SM4 |  |
|---------------------|-----------------|---------|------------|----------|--|
| Vertex Instructions | 128             | 256     | 512        | 0.41-    |  |
| Pixel Instructions  | 4+8             | 32+64   | 512        | 64k      |  |
| Vertex Constants    | 96              | 256     | 256        | 16x4096  |  |
| Pixel Constants     | 8               | 32      | 224        |          |  |
| Vertex Temps        | 16              | 16      | 16         | 4096     |  |
| Pixel Temps         | 2               | 12      | 32         |          |  |
| Vertex Inputs       | <b>16</b> 16 16 |         | 16         | 16       |  |
| Pixel Inputs        | uts 4+2 8+2     |         | 10         | 32       |  |
| Render Targets      | // 1            | 4       | 4          | 8        |  |
| Vertex Textures     | n/a             | n/a     | 4          | 400      |  |
| Pixel Textures      | 8               | 16      | 16         | 128      |  |
| Tex Size            |                 |         | 2k x 2k    | 8k x 8k  |  |
| Int Ops             |                 |         |            | ~        |  |
| Load Op             |                 | -       | <u></u>    | ~        |  |
| Derivatives         |                 |         |            | ~        |  |
| Vertex Flow Control | n/a             | Static  | Static/Dyn | Dunamia  |  |
| Pixel Flow Control  | n/a             | n/a     | Static/Dyn | Dynamic  |  |







## **Streaming Processor Array**



SPA contains 8 Texture Processor Clusters (TPC) Each TPC contains 2 Streaming Multiprocessors (SM) and 1 texture pipe (TEX) SM executes shader stages

- Communicates with the Raster Operation Processors (ROP) which have Frame Buffer (FB) memory access
- There are 6 ROPs and 6 64b FB partitions



# **Design Goals of the SM**



Independent processor and memory pipelines

- Better memory latency hiding
- Better for GPU Compute

#### **Unified Processor**

- In a fixed-function fully-pipelined architecture, a shader stage bottleneck stalls the entire pipeline
- Unified design allows shader stage load-balancing
- Scalar ALUs instead of vector
  - Shader programs are becoming longer and more scalar, hard to be efficient with a vector architecture

#### Compilable

Shader programs should be compiler-friendly

# **SM Multithreaded Multiprocessor**





# SP Multiply-Add (MAD) Unit



- MAD unit operates on fp32 operands, produces fp32 output
- Performs all fundamental FP operations: FADD, FMUL, FMAD, FMIN, FMAX
- Performs integer ops and conversions
- Fully-pipelined, but latency is not over-optimized at the expense of area

FADD and FMUL IEEE 754 compliant

- Round-to-nearest-even and round-to-zero
- Special numbers properly handled
- Denormal inputs and outputs are flushed-to-zero

# **Special Function Unit (SFU)**





# **SFU: Attribute Interpolation**



Plane equation unit generates plane equation fp32 coefficients to represent all triangle attributes
 <u>A, B, and C are fp32 interpolation parameters</u>

associated with a given triangle's attribute U

- Resulting attribute value U is fp32
- SFU must interpolate the value of each attribute per (x,y) for all pixels to be drawn:

U(x,y) = A\*x + B\*y + C

### For perspective correct interpolation:

- Interpolate 1/w, and reciprocate to form w
- Interpolate U/w
- Multiply U/w and w to form perspective-correct U

# **Transcendental Function Statistics**

19



© NVIDIA Corporation 2007

| Function           | Input<br>Interval | Accuracy<br>(good bits) | Ulp<br>error | % exactly<br>rounded | Monotonic |  |
|--------------------|-------------------|-------------------------|--------------|----------------------|-----------|--|
| 1/X                | 1/X [1,2)         |                         | 0.98         | 87%                  | Yes       |  |
| 1/sqrt(X)          | [1,4)             | 23.40                   | 1.52         | 78%                  | Yes       |  |
| 2 <sup>x</sup>     | [0,1)             | 22.51                   | 1.41         | 74%                  | Yes       |  |
| log <sub>2</sub> X | [1,2)             | 22.57                   | n/a          | n/a                  | Yes       |  |
| Sin/cos            | [0,pi/2)          | 22.47                   | n/a          | n/a                  | No        |  |

Hot Chips 2007: The NVIDIA GeForce 8800 GPU



Hot Chips 2007: The NVIDIA GeForce 8800 GPU

© NVIDIA Corporation 2007









## Detail of a single ROP pixel pipeline







### **Multisampling**



Store a unique color and depth value for each pixel sub-sample, but re-use <u>one calculated color</u> for all color sub-samples
 Strengths

 Only one color value calculated per pixel per triangle
 Z and stencil evaluated precisely; interpenetrations and bulkheads correct

#### Weaknesses

- Memory footprint N times larger than 1x
- Expensive to extend to 8x quality and beyond
- Multisampling evolved from  $1 \rightarrow 2 \rightarrow 4$  samples
- Beyond 4 sub-samples, storage cost increases faster than the image quality improves
   Even more true with UDB
  - Even more true with HDR
    - 64b and 128b per color sub-sample!
  - For the vast majority of edge pixels, 2 colors are enough
    - What matters is more detailed coverage information

Hot Chips 2007: The NVIDIA GeForce 8800 GPU

### **Coverage Sampled Antialiasing (CSAA)**



© NVIDIA Corporation 2007

Compute and store boolean coverage at 16 sub-samples Compress the redundant color and depth/stencil information into the memory footprint and bandwidth of 4 or 8 multisamples

#### Performance of 4xAA with 16x quality

- Low cost per coverage sample
- Just works with existing rendering techniques
  - HDR, stencil algorithms
- Efficient use of shader and texture hardware
- Boolean, not scalar, coverage
  - No bleed-through
  - Fallback to the stored sample count quality (4x or 8x) for highcontrast Z/stencil results
    - Inter-penetrating triangles
    - Stencil shadow volumes

# **Antialiasing Modes Comparison**



| AA Mode:                     | Brute-Force<br>Supersampling |       | Multisampling     |             | Coverage Sampling |     |    |                   |                 |
|------------------------------|------------------------------|-------|-------------------|-------------|-------------------|-----|----|-------------------|-----------------|
| Quality level:               | 1x                           | 4x    | 16x               | 1x          | 4x                | 16x | 1x | 4x                | 16x             |
| Texture/Shad<br>er Samples   | 1                            | 4     | 16                | 1           |                   | 1   | 1  | 1                 | 1               |
| Stored<br>Color/Z<br>Samples | 1                            | 4     | 16                | 1           | 4                 | 16  | 1  | 4                 | 4               |
| Coverage<br>Samples          | 1                            | 4     | 16                | 1           | 4                 | 16  | 1  | 4                 | 16              |
|                              |                              |       | ling rec<br>hader |             |                   |     |    | g reduc<br>& banc |                 |
|                              |                              | Hot C | hips 2007: The    | NVIDIA GeFo | rce 8800 GPU      |     |    | © NVIDIA C        | Corporation 200 |



## **GeForce 8800 Scalability**



Architecture was designed for scalability

 value, mainstream, enthusiast segments

 Number of SMs, TPCs and ROPs can be varied allowing the right mix for different performance and cost targets
 Upward scalability is available with Scalable Link Interconnect (SLI), allowing multiple GPUs to be connected together

Hot Chips 2007: The NVIDIA GeForce 8800 GPU

## **GeForce 8800 Performance**



© NVIDIA Corporation 2007

|                      | 7900GTX        | 8800GTX        |  |
|----------------------|----------------|----------------|--|
| Shader Model         | SM3            | SM4            |  |
| Vertex Shader Units  | 8 (vector)     |                |  |
| Pixel Shader Units   | 24 (vector)    | 128 (scalar)   |  |
| Shader Math (GFlops) | 232            | 576            |  |
| Texture Filter       | 12 GBilerp/sec | 38 GBilerp/sec |  |
| ROP Processing       | Up to 24ppc    | Up to 192ppc   |  |
| Memory Width         | 256-bit        | 384-bit        |  |
| Memory Bandwidth     | 51.2 GB/sec    | 104 GB/sec     |  |

## **GeForce 8800 Implementation**



 681 million transistors, 470 mm<sup>2</sup>
 Manufactured in TSMC 90nm
 Multiple clock domains
 384 pin memory interface connecting to 768 MB of DDR frame buffer memory, yielding 104 GB/sec of bandwidth (1.08 GHz)

Typical operating power consumption of 150 W



## **GeForce 8800 Summary**



© NVIDIA Corporation 200

Processor based architecture
previous architectures are graphics pipeline based
Scalable number of processing cores
576 GFLOPS/sec just for shader execution (1.5 GHz)
scalar processors instead of 4-vector
Hardware multithreading
12288 threads
zero-overhead thread scheduling
Scalable number of memory partitions
supports non-power of 2 partition count
Adds GPU Compute

Hot Chips 2007: The NVIDIA GeForce 8800 GPU