On an Intel Ivy Bridge test system with an Cisco Nexus SmartNIC K3P-S (formerly X25), the median latency at 10G from application to network to application is 643 nanoseconds for small packets, when taking advantage of all of the available latency optimization techniques available for Cisco Nexus SmartNIC (formerly ExaNIC). This increases with frame size in a way that's approximately ideal, that is, if you send a 164 byte packet you need to add 100 bytes at 10G line rate to this latency figure if frames are sent and received in full. This will vary with architecture but see below for benchmarking setup, utilities and trouble shooting.

Following is a general guide to optimizing your system for performance benchmarks.

In general it is advised when performing initial benchmarking investigations to setup with as little equipment as possible. We suggest a simple looped back cable for initial validation tests without other switches or network equipment in series.

Cisco Nexus SmartNIC Configuration

The simplest starting point is to install a loopback cable from the first two ports of a SmartNIC device. (i.e. exanic0:0 to exanic0:1) With the cable present exanic-config should report SFP present and signal detected for the two ports. If this is not the case confirm the SFP's are completely inserted and if possible replace with a known good cable.

The next step is to validate that the speeds of the two ports are set to match, again run exanic-config and confirm that the two Port speed: values match. For best benchmarking results this should be the highest available speed. e.g. 10,000Mbps. If the speeds differ or are slow you can change them by running

exanic-config exanic0:0 speed 10000

exanic-config exanic0:1 speed 10000

Next confirm that the Port status: values are enabled. If they are not execute the exanic-config exanic0:0 up command. The final exanic-config output for both ports should look similar to:

Port 0:
 Interface: enp1s0
 Port speed: 10000 Mbps
 Port status: enabled, SFP present, signal detected, link active

and

Port 1:
 Interface: enp1s0d1
 Port speed: 10000 Mbps
 Port status: enabled, SFP present, signal detected, link active

For simple loopback tests, also ensure bypass only mode is used: exanic-config exanic0:0 bypass-only on

Running benchmarking utilities

A number of benchmarking utilities, both for the SmartNIC and for other cards, are located in the perf-test directory provided with the distribution. To build these benchmark utilities for SmartNIC:

$ cd perf-test
$ make exanic

The exanic_perf_test application can be used to benchmark the performance of SmartNIC cards in a variety of configurations. This guide will demonstrate how to run a "loopback" benchmark, which measures the time taken for a frame to be sent and received by software with libexanic.

$ ./exanic_perf_test
exanic_perf_test: Measure the latency performance of ExaNICs with libexanic
Usage: ./exanic_perf_test -d device
         [-m testmode] [-t txport] [-r rxport]
         [-T txmode] [-R rxmode]
         [-s size] [-c count] [-w warmups] [-a]
  -m: specify the test mode (loopback/forward)
  -d: specify the exanic device name (e.g. exanic0)
  -t/-r set the port to transmit/receive packets on
  -T set the method to transmit packets (frame/preloaded)
  -R set the method to receive packets (frame/chunk_inplace)
  -s: specify the packet size to send (default 60)
  -c: specify how many packets to send (default 1000000)
  -w: specify how many warmup frames to send (default 100000)
  -a: print raw cycle counts instead of a percentile breakdown

To get an initial indication of your hosts performance with libexanic, the exanic_perf_test can be used to run a loopback benchmark between ports 0 and 1. Connect a cable from port 0 to port 1 and run exanic_perf_test as follows:

$ ./exanic_perf_test -d exanic0 -T frame -R frame
CPU GHz = 3.31
Percentile 0.000 = 625ns
Percentile 1.000 = 642ns
Percentile 5.000 = 679ns
Percentile 10.000 = 683ns
Percentile 25.000 = 689ns
Percentile 50.000 = 696ns
Percentile 75.000 = 709ns
Percentile 90.000 = 759ns
Percentile 95.000 = 770ns
Percentile 99.000 = 959ns
Percentile 100.000 = 1336ns

This configuration causes exanic_perf_test to loopback frames from port 0 to port 1, using exanic_transmit_frame() to send frames and exanic_receive_frame() to receive them. The libexanic documentation can provide more information on what these function calls perform.

It is possible to improve upon these values by making changes to the hosts BIOS settings, kernel build configuration and kernel boot parameters. This guide will also demonstrates the performance gains that are possible by taking advantage of latency saving techniques in libexanic.

BIOS Configuration.

Turn off hyperthreading, speedstep, power saving, and any other energy saving settings that may be turned on. These can cause poor latency while the CPU ramps up. It is sometimes possible to identify if this is happening by looking at the cpuinfo. To do this run:

$cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
stepping    : 9
microcode   : 0x1c
cpu MHz             : 1600.00

In the above example the CPU is running in a low power state at 1600Mhz, which is below its 3500Mhz nominal speed.

The exanic_perf_test application will by default transmit warmup frames before the benchmark actually begins, to bring the CPU out of a power saving state for optimal results.

Kernel Build Configuration

Ensure that your kernel is built with CONFIG_NO_HZ_FULL=y. This setting will allow you to run the kernel in fully tickless mode on your performance cores. Timer ticks from the kernel will interrupt your process causing it to have latency spikes. To check if your kernel supports full tickless behaviour examine the kernel config file, e.g.:

cat /boot/config-4.10.11-100.fc24.x86_64 | grep NO_HZ_FULL
CONFIG_NO_HZ_FULL=y

Kernel Boot Configuration

When testing software performance, ensure that the kernel boot configuration is configured for realtime performance. This is usually done by modifying /etc/default/grub. Following is an example:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 intel_idle.max_cstate=0 irqaffinity=0,1 selinux=0 audit=0 tsc=reliable"

isolcpus=2,3 causes the scheduler to remove CPUs 2 and 3 from the scheduling pool
nohz_full=2,3 causes CPUs 2 and 3 to run in fully tickless mode
rcu_nocbs=2,3 stops RCU callbacks to these cores
intel_idle.max_cstate=0 disables the intel_idle and fall back mode on acpi_idle
irqaffinity=0,1 sets the default IRQ mask to cores 0 and 1
selinux=0 disable the SE Linux extensions
audit=0 disable kernel auditing system
tsc=reliable marks the tsc clocksource as reliable, this disables clocksource verification at runtime

After regenerating the boot image and rebooting, you can check that this command has taken effect by running the below command, you should see the parameters above

$ cat /proc/cmdline

See Linux kernel parameters documentation for more information.

Hardware Configuration

Make sure the SmartNIC is plugged into a PCIe x8 Gen 3 slot and is running @ 8.0 GT/s per lane (for systems that support PCIe Gen3) . This can be identified by running the lspci command and looking at the LnkSta (link status) output.

$ sudo lspci -d 1ce4:* -vvv |grep LnkSta:
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Make sure the SmartNIC is plugged into a PCIe slot directly connected to a CPU. The server or motherboard documentation should indicate which slots are connected to CPUs and which are connected to the chipset. If unsure, the following procedure can be used. First determine the bus number of the SmartNIC from lspci:

$ sudo lspci -d 1ce4:*
02:00.0 Ethernet controller: Exablaze ExaNIC X25

In this case, the bus number is 02. Now search for the device that has secondary=02 in the output of lspci -v, for example:

$ sudo lspci -v
...
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core
processor PCI Express Root Port (rev 09) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
...

For optimal performance, this should be a processor root port (in this case, “Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port”).

Software Settings

There are a number of configuration options for Linux that will improve realtime performance.

We recommend the following:

cpus="2 3"

echo -1 > /proc/sys/kernel/sched_rt_runtime_us
echo 0 > /proc/sys/kernel/watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 3 > /proc/irq/default_smp_affinity


for irq in `ls /proc/irq/`; do echo 1 > /proc/irq/$irq/smp_affinity; done
for irq in `ls /proc/irq/`; do echo -n "$irq  ";  cat /proc/irq/$irq/smp_affinity_list; done

for cpu in $cpus
do
    echo "performance" > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor
    echo 0 >/sys/devices/system/machinecheck/machinecheck$cpu/check_interval
done

The above code:

disables the Linux realtime throttling which ensures that realtime processes cannot starve the CPUS.
disables the Linux watchdog timer which is used to detect and recover from software faults.
disables the debugging feature for catching hardware hangings.
sets the default CPU affinit of 0b11 (3), which means that only CPU 0 and 1 handle interrupts.
moves all interrupts off cpu 2 and 3.

For more information, see Improving Linux Realtime Properties for more information.

Improving performance with libexanic (raw frames)

When running a benchmark, pin the process to one of the isolated cores as follows:

$ sudo taskset -c 2 ./exanic_perf_test -d exanic0 -T frame -R frame

It is possible to obtain better results by using faster methods for transmit and receive. By running exanic_perf_test with the following options:

$ sudo taskset -c 2 ./exanic_perf_test -d exanic0 -T preloaded -R chunk_inplace
CPU GHz = 3.31
Percentile 0.000 = 568ns
Percentile 1.000 = 597ns
Percentile 5.000 = 603ns
Percentile 10.000 = 607ns
Percentile 25.000 = 617ns
Percentile 50.000 = 629ns
Percentile 75.000 = 637ns
Percentile 90.000 = 642ns
Percentile 95.000 = 671ns
Percentile 99.000 = 750ns
Percentile 100.000 = 1209ns

This causes exanic_perf_test to use transmit preloading and in-place chunked receive. Running the exanic_perf_test application in this manner ensures the best results.

Benchmarking latency with exasock (UDP/TCP)

We use sockperf for testing because it is open-source and well understood. Before testing UDP/TCP, please ensure raw frames are working correctly (as from above).

Start by downloading the sockperf source from the Github repository, and then build the application by running:

$ ./autogen.sh
$ ./configure --prefix=
$ make
$ make install

Ensure that bypass-only and local loopback are disabled:

$ exanic-config exanic0:0 bypass-only off
$ exanic-config exanic0:0 local-loopback off

Set up a second machine with the configuration options from the steps above, and connect ports 0 together with another SmartNIC using a short fiber/direct attach cable.

Then, set up IP's on both hosts with:

$ ifconfig <interface> <ip-address> netmask <mask>

Run accelerated TCP/UDP sockperf on client and server:

server# exasock taskset -c 2 sockperf sr -i 10.10.0.2
client# exasock taskset -c 2  sockperf pp -i 10.10.0.2 -t5 -m 14
sockperf: == version #3.1-16.gitc6a0d0e3ab53 ==
sockperf: [Total Run] RunTime=5.450 sec; SentMessages=2882949;
sockperf: ====> avg-lat=  0.805 (std-dev=0.034)
sockperf: Summary: Latency is 0.805 usec
sockperf: ---> <MAX> observation =    3.062
sockperf: ---> percentile 99.999 =    1.174
sockperf: ---> percentile 99.900 =    1.002
sockperf: ---> percentile 99.000 =    0.931
sockperf: ---> percentile 90.000 =    0.838
sockperf: ---> percentile 75.000 =    0.817
sockperf: ---> percentile 50.000 =    0.802
sockperf: ---> percentile 25.000 =    0.784
sockperf: ---> <MIN> observation =    0.736

RESULT: ExaSock UDP 1/2RTT latency <810ns. This page was last updated on Dec-10-2020.