# Understanding Roadblocks in Virtual Network I/O: A Comprehensive Analysis of CPU Cache Usage

D. Takeya<sup>†</sup> | <u>R. Kawashima</u><sup>†</sup> Y. Nakayama<sup>‡</sup> T. Hayashi<sup>‡</sup>

Contact: *R. Kawashima* <kawa1983@ieee.org>

<sup>†</sup>Nagoya Institute of Technology <sup>‡</sup>BOSCO Technologies, Inc. H. Matsuo

# POINTS



### Identify the performance bottleneck



### Focus on CPU L1 cache usage

Past focus: packet copy



Show a potential to achieve 100+ Mpps

6x higher than DPDK/vhost-user





#### 2. CACHE & VIRTUAL NETWORK I/O Why CPU cache?, CPU cache usage

#### 3. OUR STUDY Goals, Approach, Evaluation design

4. RESULTS Environment, Throughput, Analysis

5. CONCLUSION Conclusion and future work

### CNF (Cloud-native Network Function)

### Container-formed network functions



#### Virtual Network I/O Performance critical part of CNFs Container MBuf Packet Application-dependent MBuf Packet NFV-node Packet I/O Logic Container Driver DPDK CNF Тх Inter-Process Communication vNet I/O ♦ Vhost-user (de-facto) ◆ Bottleneck (15-20 Mpps) MBuf Packet virtual network I/O Packet I/O Logic Driver Virtual Switch DPDK Host CPU Memory Port NIC Port NIC Port Port

Why does virtual network I/O halve throughput?

### Zero-copy

#### Past studies focused on packet copy



- Packet (memory) copy is removed
  - Various implementations
    - NetVM (2014), OpenNetVM (2016)
    - ZCopy-Vhost (2017)
    - IOVTee (2018)
- Marginal effect on performance

Throughput (64B)

with packet copy

≒15 Gbps

gain

20-40%

Isn't packet copy the true bottleneck? 5

# AGENDA



#### 2. CACHE & VIRTUAL NETWORK I/O Why CPU cache?, CPU cache usage

3. OUR STUDY Goals, Approach, Evaluation design

> 4. RESULTS Environment, Throughput, Analysis

5. CONCLUSION Conclusion and future work

# Why CPU Cache?

#### Every little bit adds up



- Cache is always accessed
  - Virtual Network I/O
    Due to packet copy? or queue handling?
- Penalty of cache misses

#### Performance cost

Cache miss  $\Rightarrow$  Packet copy (64B)

Why does virtual network I/O need frequent cache accesses?

### CPU Cache Usage (in Virtual Network I/O)

#### Three-body problem in cache/memory



# AGENDA



#### 2. CACHE & VIRTUAL NETWORK I/O Why CPU cache?, CPU cache usage



4. RESULTS Environment, Throughput, Analysis

5. CONCLUSION Conclusion and future work



### Understand the true bottleneck in virtual network I/O

### Unveil the effect of cache usage on performance

### Assess a possibility of fair speed-up

### APPROACH

#### Exhaustive experiments and analyses



# **Evaluation Design**

#### Inheritance and Multiplexing



### EIVU (Essential Implementation of Vhost-User)

#### Easy-to-customize evaluation framework





13

#### Equivalent design/implementation and performance

# AGENDA



#### 2. CACHE & VIRTUAL NETWORK I/O Why CPU cache?, CPU cache usage

#### 3. OUR STUDY Goals, Approach, Evaluation design



5. CONCLUSION Conclusion and future work

### Contents & Environment

#### To what extent does cache usage affect?



### Throughput vs. L1 Cache Usage



What was the performance bottleneck?

### Analysis

#### Look deep inside the best-case item!





#### Why is factor (c) so influential? 17

#### Performance Bottleneck The buffer header causes implicit conflicts H/W prefetching С С С NFV node NF 3 Packet Packet Packet Container Invalidation CNF Packet copies (memory accesses) Rx 1 Packet Packet Packet vSwitch С С C. C Simple L2 forwarding Host Port NIC Port Future challenge Re-design of packet buffer structure

# AGENDA



#### 2. CACHE & VIRTUAL NETWORK I/O Why CPU cache?, CPU cache usage

#### 3. OUR STUDY Goals, Approach, Evaluation design

#### 4. RESULTS Environment, Throughput, Analysis

5. CONCLUSION Conclusion and future work

### Conclusion and Future Work

#### Theme: Performance issue of virtual network I/O

| Then                                           | Now                               |
|------------------------------------------------|-----------------------------------|
| Throughput: 15-20 Mpps                         | Throughput: 100+ Mpps (potential) |
| Focus:  Packet copy    Not the true bottleneck | Focus: CPU cache usage            |
| Approach: Zero-copy                            | Approach: Re-design (structure)   |
|                                                |                                   |

♦ Over 99.99% of L1 hit ratio is necessary

♦ Implicit cache conflicts need to be avoided

#### Challenge: Re-design of packet buffer structure

# RESOURCES

#### EIVU platform

https://github.com/sdnnitech/EIVU

#### Evaluation design

https://sdnnitech.github.io/EIVU/eval/evaluation.html

#### Result

https://sdnnitech.github.io/EIVU/eval/results.html

Mathematical analysis

https://github.com/sdnnitech/CESim

### [Appendix] Results on the Other Servers



# [Appendix] Impact of Cache Invalidations



Major cause of L1 cache misses is invalidation!

### [Appendix] Effects of L2 and L3 Caches



L2/L3 cache usages have little impact on throughput!

# [Appendix] Tipping Points

### Why does the tipping points appear?

### Experiment

- pp. 15-18
- Acquired values are useful
- Real environment is complex to dig in

### Modeling

- Essential nature of packet forwarding
- Experimental results are feedbacked

### Simulation

- Throughput vs. Cache usage
- Can reproduce the experimental results?

### [Appendix] Modeling (Parameters)

### Experiment

Best-case item

> Throughput

Cache usage

No. cache accesses (per packet)

#### Machine spec.

- CPU clock
- Access latency

#### Constants

- Input parameters
- Pure proc. ratio (α)
- Acceleration factor (β)

#### Variable

Cache hit ratio (L1)

Modeling

### [Appendix] Modeling (Construction)

#### Simple model (Non-parallelized)



#### Calculate throughput!

Modified model (Parallelized)

Tipping point doesn't appear on Simple model (see next page)



### [Appendix] Simulation



L1 cache miss would cancel parallelization effect!