# **EXILINX**ALL PROGRAMMABLE

## Dataflow Architectures for 10Gbps Line-rate Key-value-Stores

Michaela Blott, Kees Vissers - Xilinx Research

© Copyright 2013 Xilinx

## Agenda

#### Current key-value stores (KVS)

- State-of-the-art
- Bottlenecks

#### Dataflow architectures for KVS

- Why dataflow architectures
- Prototype architecture
- Results
- Limitations

## **Key-Value Stores**

Common middleware application to alleviate access bottlenecks on databases



- Most popular and most recent database contents are cached in main memory of a tier of x86 servers
- Provides the abstraction of an associative memory
  - Values are stored or retrieved by sending the associated key
  - GET(KEY) and SET(KET,VALUE)

GET(k): receive(p); k = parse(p);a = hashKey(k);v = readValue(a);new p = format(v); send(new p);

> Memcached is a commonly used open source package for KVS

## **Typical Implementations**

> Hardware:



XILINX > ALL PROGRAMMABLE.

#### **Best Published Performance Numbers**

| Platform                                        | RPS [M] | Latency [us]        | RPS/W [K] |    |
|-------------------------------------------------|---------|---------------------|-----------|----|
| Intel <sup>®</sup> Xeon <sup>®</sup> (8 cores)* | 1.34    | 200-300             | 7         | 7K |
| Intel Xeon (1 4MRPS)*                           | 3.15    | 200-300             | 11.2      |    |
| Memcached                                       | 1.8     | 12<br>200us late    |           |    |
| TilePRO (64 cores)***                           | 0.34    | 200 <del>-400</del> | 5.0       |    |
| TilePRO (4x64 cores)***                         | 1.34    | 200-400             | 5.8       |    |
| Chalamalasetti (FPGA)****                       | 0.27    | 2.4-12              | 30.04     |    |

\* WIGGINS, A., AND LANGSTON, J. Enhancing the scalability of memcached. In Intel Software Network (2012).

\*\*JOSE, J., SUBRAMONI, H., LUO, M., ZHANG, M., HUANG, J., UR RAHMAN, M. W., ISLAM, N. S., OUYANG, X., WANG, H., SUR, S., AND PANDA, D. K. Memcached design on high performance rdma capable interconnects. 2012 41st International Conference on Parallel Processing 0 (2011), 743–752.

\*\*\*\* Kevin Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2013. Thin servers with smart pipes: designing SoC accelerators for memcached. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 36-47.

XII INX > ALL PROGRAMMABLE.

<sup>\*\*\*</sup> BEREZECKI, M., FRACHTENBERG, E., PALECZNY, M., AND STEELE, K. Power and performance evaluation of memcached on the tilepro64 architecture. In Green Computing Conference and Workshops (IGCC), 2011 International (July 2011), pp. 1 –8.

## **Bottlenecks – TCP/IP Stack**

#### CPU intensive

| Tota | 1: | <b>4</b> 5% <b>us</b> , | 113%sy,   | 0%ni,    | 534%id,   | 08wa,     | 0%hi,   | 109%si,    | 0%st   |
|------|----|-------------------------|-----------|----------|-----------|-----------|---------|------------|--------|
| Mom  | ¢  | 5859040k t              | total 38' | 20060k u | ad 2020   | 080k free | 29350   | 94k buffer | c      |
| Cpu7 | :  | 7.7%us,                 | 15.4%sy,  | 0.0%ni,  | 73.1%id,  | 0.0%wa,   | 0.0%hi, | 3.8%si,    | 0.0%st |
| Cpu6 | :  | 0.0%us,                 | 0.0%sy,   | 0.0%ni,  | 100.0%id, | 0.0%wa,   | 0.0%hi, | 0.0%si,    | 0.0%st |
| Cpu5 | :  | 1.9%us,                 | 11.5%sy,  | 0.0%ni,  | 84.6%id,  | 0.0%wa,   | 0.0%hi, | 1.9%si,    | 0.0%st |
| Cpu4 | :  | 18.9%us,                | 56.6%sy,  | 0.0%ni,  | 13.2%id,  | 0.0%wa,   | 0.0%hi, | 11.3%si,   | 0.0%st |
| Cpu3 | :  | 3.6%us,                 | 0.0%sy,   | 0.0%ni,  | 96.4%id,  | 0.0%wa,   | 0.0%hi, | 0.0%si,    | 0.0%st |
| Cpu2 | :  | 4.9%us,                 | 6.6%sy,   | 0.0%ni,  | 86.9%id,  | 0.0%wa,   | 0.0%hi, | 1.6%si,    | 0.0%st |
| Cpu1 | :  | 5.7%us,                 | 13.2%sy,  | 0.0%ni,  | 77.4%id,  | 0.0%wa,   | 0.0%hi, | 3.8%si,    | 0.0%st |
| Cpuo | :  | I.9%us,                 | 9.6%sy,   | 0.0%n1,  | I.9%1d,   | 0.0%wa,   | 0.0%n1, | 80.5%Sl,   | 0.0%ST |

#### > Frequent interrupts\*

- Leads to high rate of instruction cache misses (up to 160 MPKI)
  - Requires much larger L1 instruction caches
- Causes poor branch predictability
  - Stalls in the superscalar pipeline architecture of standard x86

#### High latency\*

 Packets have to be DMA'ed from/to network adapter over the PCIe<sup>®</sup> bus which introduces high latency

No resource sharing between memcached and TCP/IP stack
 Close integration of network, compute and memory

\*Top (while running 4 memcached instances



© Copyright 2013 Xilinx

#### **Bottlenecks** – *Synchronization Overhead and L3 Cache*

#### > Threads stall on memory locks

- 1. Large locks effectively serialize execution
- 2. Synchronization races cause poor branch predictability
- This leads to inefficient use of superscalar pipelines
- Intel has shown that by improving the granularity of the locks, we can scale to 1.4 MRPS (from 0.2MRPS)\*
- Last level cache ineffective due to random-access nature of key-valuestores (miss rate 60% - 95%\*\*)
  - Multithreading can't effectively hide memory access latencies
  - Cause considerable power waste

=> Exploitation of instruction-level parallelism through data-flow architectures
 => Static memory access schedule eliminates memory arbitration conflict
 => Data caching is ineffective

## **Why Dataflow Architectures?**

- > Memcached is fundamentally a streaming problem
  - Data is moved from network to memory and back with little compute



- > As such, dataflow architectures, frequently used for network processing, should be well suited towards the application
  - Higher performance
  - Lower power consumption

## **Proposed System Architecture**



\*below 3% of 1 core for 10% SET operations \*limited memory access bandwidth on platform

#### XILINX > ALL PROGRAMMABLE.

## **FPGA-based Dataflow Architecture**



#### ilinx 🛛 🗧 🕻 XILINX 🄰 ALL PROGRAMMABLE..

## **FPGA-based Dataflow Architecture**



 => Exploiting instruction-level parallelism increases throughput, lowers latency and is more power efficient
 => Inherently scalable

XII INX > ALL PROGRAMMABLE.

## **Hash Table Architecture**

#### > Collision handling through parallel lookup (8-way)

- Suits the wide memory bus
- > Flexible key handling through striping



## **Hash Table Dimensions**



#### Size for hash table <400MB</p>

Key limit is 168 byte (memcached max: 250B, most use-cases <50B)</li>

> On our platform this hash table manages 23.6GB of value storage

## **System Test Setup**



## **System Test Setup**



## **Power - Test Setup & Results**



#### Test system 1: with FPGA board

#### Test system 2: without FPGA board



XII INX > ALL PROGRAMMABLE.

\*(Power sourced from: power plug meter, xpower, data sheets and power regulator readings) \*\*(UDP, binary protocol)

\*\*\*(includes FPGA and host system)

© Copyright 2013 Xilinx

## **Results - Performance**



## **First Results of Memcached Evaluation**

- > Sustained line rate processing for 10GE 13MRPS possible, at smallest packet size
  - Significant improvement over latest x86 numbers
- > Lower power
- > Combined: 36x in RPS/Watt with low variation
- > Cutting edge latency
  - microseconds instead of 100s of microseconds

| Platform             | RPS [M] | Latency [us] | RPS/W [K] |  |
|----------------------|---------|--------------|-----------|--|
| Intel Xeon (8 cores) | 1.34    | 200-300      | 7         |  |
| TilePRO (64 cores)   | 0.34    | 200-400      | 3.6       |  |
| FPGA (board only)    | 13.02   | 3.5-4.5      | 254.8     |  |
| FPGA (with host)     | 13.02   | 3.5-4.5      | 106.7     |  |



XII INX > ALL PROGRAMMABLE.

## **Code Complexity**



© Copyright 2013 Xilinx

XILINX > ALL PROGRAMMABLE...

#### Limitations Development Effort

- > Hardware design exposes a greater complexity to the user and requires therefore more engineering effort
- > HLS reduces the complexity and shortens the development effort



🗶 XILINX 🕨 ALL PROGRAMMABLE.

=> Greater performance at expense of larger development effort
 => Exploration of how HLS can reduce the cost

#### Limitations Development Effort

- > Hardware design exposes a greater complexity to the user and requires therefore more engineering effort
- > HLS reduces the complexity and shortens the development effort



## => Greater performance at expense of larger development effort => Exploration of how HLS can reduce the cost

© Copyright 2013 Xilinx

XILINX > ALL PROGRAMMABLE.

### **Other Limitations**

#### > TCP offload restricted to #sessions

=> Future investigation into high session count TOE

Limited storage capacity => SSD

Memory allocation & cache management on host CPU Limited collision handling Limited protocol support => Exploration of SoC architecture

## **Summary & Next Steps**

Dataflow architecture delivers 10Gbps line-rate performance and scalability to higher rates

Significantly higher RPS/Watt, with that lower TCO

> Minimal latency

> HLS reduces the complexity and shortens the development effort

#### > Next Steps:

- Address limitations
- Trials with real use cases

# **ALL PROGRAMMABLE**

Thank You. mblott@xilinx.com

© Copyright 2013 Xilinx

#### References

[1] ARVIND, AND NIKHIL, R. Executing a program on the mittagged-token dataflow architecture. Computers, IEEE Transactions on 39, 3 (March 1990), 300–318.

[2] ATIKOGLU, B., XU, Y., FRACHTENBERG, E., JIANG, S., AND PALECZNY, M. Workload analysis of a large-scale key-value store. SIGMETRICS Perform. Eval. Rev. 40, 1 (June 2012), 53–64.

[3] BANDO, M., ARTAN, N., AND CHAO, H. Flashlook: 100-gbps hash-tuned route lookup architecture. In High Performance Switching and Routing, 2009. HPSR 2009. International Conference on (June 2009), pp. 1–8.

[4] BEREZECKI, M., FRACHTENBERG, E., PALECZNY, M., AND STEELE, K. Power and performance evaluation of memcached

on the tilepro64 architecture. In Green Computing Conference and Workshops (IGCC), 2011 International (July 2011), pp. 1 –8.

[5] CONG, J., LIU, B., NEUENDORFFER, S., NOGUERA, J., VISSERS, K., AND ZHANG, Z. High-level synthesis for fpgas:

From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 30, 4 (April 2011), 473 – 491.

[6] FERDMAN, M., ADILEH, A., KOCBERBER, O., VOLOS, S., ALISAFAEE, M., JEVDJIC, D., KAYNAK, C., POPESCU, A. D., AILAMAKI, A., AND FALSAFI, B. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. SIGARCH Comput. Archit. News 40, 1 (Mar. 2012), 37–48.

[7] HETHERINGTON, T. H., ROGERS, T. G., HSU, L., O'CONNOR, M., AND AAMODT, T. M. Characterizing and evaluating a key-value store application on heterogeneous cpu-gpu systems. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software (Washington, DC,

USA, 2012), ISPASS '12, IEEE Computer Society, pp. 88-98.

[8] ISTVAN, Z. Hash Table for Large Key-Value Stores on FPGAs. Master's thesis, ETH Zurich, Dept. of Computer Science, Systems Group, Switzerland, 2013.

[9] JOSE, J., SUBRAMONI, H., LUO, M., ZHANG, M., HUANG, J., UR RAHMAN, M. W., ISLAM, N. S., OUYANG, X., WANG, H., SUR, S., AND PANDA, D. K. Memcached design on high performance rdma capable interconnects. 2012 41st International Conference on Parallel Processing 0 (2011), 743–752.

[10] LANG, W., PATEL, J. M., AND SHANKAR, S. Wimpy node clusters: what about non-wimpy workloads? In DaMoN (2010), pp. 47–55.
[11] MATTSON, R., GECSEI, J., SLUTZ, D., AND TRAIGER, I. Evaluation techniques for storage hierarchies. IBM Systems Journal 9, 2 (1970), 78–117.

[12] WIGGINS, A., AND LANGSTON, J. Enhancing the scalability of memcached. In Intel Software Network (2012).

[13] Kevin Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2013. Thin servers with smart pipes: designing SoC accelerators for memcached. InProceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 36-47.

XII INX > ALL PROGRAMMABLE.