On an Intel Ivy Bridge test system with an Cisco Nexus SmartNIC K3P-S (formerly X25), the median latency at 10G from application to network to application is 643 nanoseconds for small packets, when taking advantage of all of the available latency optimization techniques available for Cisco Nexus SmartNIC (formerly ExaNIC). This increases with frame size in a way that's approximately ideal, that is, if you send a 164 byte packet you need to add 100 bytes at 10G line rate to this latency figure if frames are sent and received in full. This will vary with architecture but see below for benchmarking setup, utilities and trouble shooting.
Following is a general guide to optimizing your system for performance benchmarks.
In general it is advised when performing initial benchmarking investigations to setup with as little equipment as possible. We suggest a simple looped back cable for initial validation tests without other switches or network equipment in series.
Cisco Nexus SmartNIC Configuration
The simplest starting point is to install a loopback cable from the first two ports of a SmartNIC device. (i.e. exanic0:0 to exanic0:1)
With the cable present
exanic-config should report
SFP present and
signal detected for the two ports. If this is not the case confirm the SFP's are completely inserted and if possible replace with a known good cable.
The next step is to validate that the speeds of the two ports are set to match, again run
exanic-config and confirm that the two
Port speed: values match. For best benchmarking results this should be the highest available speed. e.g. 10,000Mbps. If the speeds differ or are slow you can change them by running
exanic-config exanic0:0 speed 10000 exanic-config exanic0:1 speed 10000
Next confirm that the
Port status: values are enabled. If they are not execute the
exanic-config exanic0:0 up command.
The final exanic-config output for both ports should look similar to:
Port 0: Interface: enp1s0 Port speed: 10000 Mbps Port status: enabled, SFP present, signal detected, link active
Port 1: Interface: enp1s0d1 Port speed: 10000 Mbps Port status: enabled, SFP present, signal detected, link active
For simple loopback tests, also ensure bypass only mode is used:
exanic-config exanic0:0 bypass-only on
Running benchmarking utilities
A number of benchmarking utilities, both for the SmartNIC and for other cards, are located in the perf-test directory provided with the distribution. To build these benchmark utilities for SmartNIC:
$ cd perf-test $ make exanic
exanic_perf_test application can be used to benchmark the performance of SmartNIC cards in a variety of configurations. This guide will demonstrate how to run a "loopback" benchmark, which measures the time taken for a frame to be sent and received by software with libexanic.
$ ./exanic_perf_test exanic_perf_test: Measure the latency performance of ExaNICs with libexanic Usage: ./exanic_perf_test -d device [-m testmode] [-t txport] [-r rxport] [-T txmode] [-R rxmode] [-s size] [-c count] [-w warmups] [-a] -m: specify the test mode (loopback/forward) -d: specify the exanic device name (e.g. exanic0) -t/-r set the port to transmit/receive packets on -T set the method to transmit packets (frame/preloaded) -R set the method to receive packets (frame/chunk_inplace) -s: specify the packet size to send (default 60) -c: specify how many packets to send (default 1000000) -w: specify how many warmup frames to send (default 100000) -a: print raw cycle counts instead of a percentile breakdown
To get an initial indication of your hosts performance with libexanic, the
exanic_perf_test can be used to run a loopback benchmark between ports 0 and 1. Connect a cable from port 0 to port 1 and run
exanic_perf_test as follows:
$ ./exanic_perf_test -d exanic0 -T frame -R frame CPU GHz = 3.31 Percentile 0.000 = 625ns Percentile 1.000 = 642ns Percentile 5.000 = 679ns Percentile 10.000 = 683ns Percentile 25.000 = 689ns Percentile 50.000 = 696ns Percentile 75.000 = 709ns Percentile 90.000 = 759ns Percentile 95.000 = 770ns Percentile 99.000 = 959ns Percentile 100.000 = 1336ns
This configuration causes
exanic_perf_test to loopback frames from port 0 to port 1, using
exanic_transmit_frame() to send frames and
exanic_receive_frame() to receive them. The libexanic documentation can provide more information on what these function calls perform.
It is possible to improve upon these values by making changes to the hosts BIOS settings, kernel build configuration and kernel boot parameters. This guide will also demonstrates the performance gains that are possible by taking advantage of latency saving techniques in libexanic.
Turn off hyperthreading, speedstep, power saving, and any other energy saving settings that may be turned on. These can cause poor latency while the CPU ramps up. It is sometimes possible to identify if this is happening by looking at the
cpuinfo. To do this run:
$cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz stepping : 9 microcode : 0x1c cpu MHz : 1600.00
In the above example the CPU is running in a low power state at 1600Mhz, which is below its 3500Mhz nominal speed.
exanic_perf_test application will by default transmit warmup frames before the benchmark actually begins, to bring the CPU out of a power saving state for optimal results.
Kernel Build Configuration
Ensure that your kernel is built with
CONFIG_NO_HZ_FULL=y. This setting will allow you to run the kernel in fully tickless mode on your performance cores. Timer ticks from the kernel will interrupt your process causing it to have latency spikes. To check if your kernel supports full tickless behaviour examine the kernel config file, e.g.:
cat /boot/config-4.10.11-100.fc24.x86_64 | grep NO_HZ_FULL CONFIG_NO_HZ_FULL=y
Kernel Boot Configuration
When testing software performance, ensure that the kernel boot configuration is configured for realtime performance. This is usually done by modifying /etc/default/grub. Following is an example:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 intel_idle.max_cstate=0 irqaffinity=0,1 selinux=0 audit=0 tsc=reliable"
isolcpus=2,3causes the scheduler to remove CPUs 2 and 3 from the scheduling pool
nohz_full=2,3causes CPUs 2 and 3 to run in fully tickless mode
rcu_nocbs=2,3stops RCU callbacks to these cores
intel_idle.max_cstate=0disables the intel_idle and fall back mode on acpi_idle
irqaffinity=0,1sets the default IRQ mask to cores 0 and 1
selinux=0disable the SE Linux extensions
audit=0disable kernel auditing system
tsc=reliablemarks the tsc clocksource as reliable, this disables clocksource verification at runtime
After regenerating the boot image and rebooting, you can check that this command has taken effect by running the below command, you should see the parameters above
$ cat /proc/cmdline
See Linux kernel parameters documentation for more information.
Make sure the SmartNIC is plugged into a PCIe x8 Gen 3 slot and is running @ 8.0 GT/s per lane (for systems that support PCIe Gen3) . This can be identified by running the lspci command and looking at the LnkSta (link status) output.
$ sudo lspci -d 1ce4:* -vvv |grep LnkSta: LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Make sure the SmartNIC is plugged into a PCIe slot directly connected to a CPU. The server or motherboard documentation should indicate which slots are connected to CPUs and which are connected to the chipset. If unsure, the following procedure can be used. First determine the bus number of the SmartNIC from lspci:
$ sudo lspci -d 1ce4:* 02:00.0 Ethernet controller: Exablaze ExaNIC X25
In this case, the bus number is
02. Now search for the device that has
secondary=02 in the
output of lspci -v, for example:
$ sudo lspci -v ... 00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 ...
For optimal performance, this should be a processor root port (in this case, “Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port”).
There are a number of configuration options for Linux that will improve realtime performance.
We recommend the following:
cpus="2 3" echo -1 > /proc/sys/kernel/sched_rt_runtime_us echo 0 > /proc/sys/kernel/watchdog echo 0 > /proc/sys/kernel/nmi_watchdog echo 3 > /proc/irq/default_smp_affinity for irq in `ls /proc/irq/`; do echo 1 > /proc/irq/$irq/smp_affinity; done for irq in `ls /proc/irq/`; do echo -n "$irq "; cat /proc/irq/$irq/smp_affinity_list; done for cpu in $cpus do echo "performance" > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor echo 0 >/sys/devices/system/machinecheck/machinecheck$cpu/check_interval done
The above code:
- disables the Linux realtime throttling which ensures that realtime processes cannot starve the CPUS.
- disables the Linux watchdog timer which is used to detect and recover from software faults.
- disables the debugging feature for catching hardware hangings.
- sets the default CPU affinit of 0b11 (3), which means that only CPU 0 and 1 handle interrupts.
- moves all interrupts off cpu 2 and 3.
For more information, see Improving Linux Realtime Properties for more information.
Improving performance with libexanic (raw frames)
When running a benchmark, pin the process to one of the isolated cores as follows:
$ sudo taskset -c 2 ./exanic_perf_test -d exanic0 -T frame -R frame
It is possible to obtain better results by using faster methods for transmit and receive. By running
exanic_perf_test with the following options:
$ sudo taskset -c 2 ./exanic_perf_test -d exanic0 -T preloaded -R chunk_inplace CPU GHz = 3.31 Percentile 0.000 = 568ns Percentile 1.000 = 597ns Percentile 5.000 = 603ns Percentile 10.000 = 607ns Percentile 25.000 = 617ns Percentile 50.000 = 629ns Percentile 75.000 = 637ns Percentile 90.000 = 642ns Percentile 95.000 = 671ns Percentile 99.000 = 750ns Percentile 100.000 = 1209ns
Benchmarking latency with exasock (UDP/TCP)
We use sockperf for testing because it is open-source and well understood. Before testing UDP/TCP, please ensure raw frames are working correctly (as from above).
Start by downloading the sockperf source from the Github repository, and then build the application by running:
$ ./autogen.sh $ ./configure --prefix= $ make $ make install
Ensure that bypass-only and local loopback are disabled:
$ exanic-config exanic0:0 bypass-only off $ exanic-config exanic0:0 local-loopback off
Set up a second machine with the configuration options from the steps above, and connect ports 0 together with another SmartNIC using a short fiber/direct attach cable.
Then, set up IP's on both hosts with:
$ ifconfig <interface> <ip-address> netmask <mask>
Run accelerated TCP/UDP sockperf on client and server:
server# exasock taskset -c 2 sockperf sr -i 10.10.0.2 client# exasock taskset -c 2 sockperf pp -i 10.10.0.2 -t5 -m 14 sockperf: == version #3.1-16.gitc6a0d0e3ab53 == sockperf: [Total Run] RunTime=5.450 sec; SentMessages=2882949; sockperf: ====> avg-lat= 0.805 (std-dev=0.034) sockperf: Summary: Latency is 0.805 usec sockperf: ---> <MAX> observation = 3.062 sockperf: ---> percentile 99.999 = 1.174 sockperf: ---> percentile 99.900 = 1.002 sockperf: ---> percentile 99.000 = 0.931 sockperf: ---> percentile 90.000 = 0.838 sockperf: ---> percentile 75.000 = 0.817 sockperf: ---> percentile 50.000 = 0.802 sockperf: ---> percentile 25.000 = 0.784 sockperf: ---> <MIN> observation = 0.736
RESULT: ExaSock UDP 1/2RTT latency <810ns. This page was last updated on Dec-10-2020.