On an Intel Ivy Bridge test system with an ExaNIC X10, the median latency from application to network to application is 780 nanoseconds for small packets. This increases with frame size in a way that's approximately ideal, ie if you send a 164 byte packet you need to add 100 bytes at 10G line rate to this latency figure. This will vary with architecture but see below for benchmarking setup, utilities and trouble shooting.
Following is a general guide to optimizing your system for performance benchmarks.
In general it is advised when performing initial benchmarking investigations to setup with as little equipment as possible. We suggest a simple looped back cable for initial validation tests without other switches or network equipment in series.
Running benchmarking utilities
A number of benchmarking utilities, both for the ExaNIC and for other cards, are located in the perf-test directory provided with the distribution. To build these benchmark utilities for ExaNIC:
$ cd perf-test $ make exanic
The simplest benchmark to run is
exanic_loopback, which sends packets
out from one port and receives them on another.
$ ./exanic_loopback --help exanic_loopback: sends a packet out from one port and waits for it on another, reporting timing statistics usage: exanic_loopback [-r] device tx_port rx_port data_size count
With the cable connected and link up from port 0 to port 1 run for example (to obtain 1000 samples at 64 byte frame size):
$ ./exanic_loopback exanic0 0 1 64 1000 min=737ns median=770ns max=988ns first=2423ns cpu_ghz=3.492
Turn off hyperthreading, speedstep, power saving, and any other energy saving settings that may be turned on. These can cause poor latency while the CPU ramps up. It is sometimes possible to identify if this is happening by looking at the
cpuinfo. To do this run:
$cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz stepping : 9 microcode : 0x1c cpu MHz : 1600.00
In the above example the CPU is running in a low power state at 1600Mhz, which is below its 3500Mhz nominal speed.
Another way is to check the output of running the loopback test back-to-back e.g.
# ./exanic_loopback exanic0 0 1 64 1000; ./exanic_loopback exanic0 0 1 64 100000 min=1319ns median=1425ns max=26220ns first=19390ns cpu_ghz=4.007 min=819ns median=891ns max=8187ns first=3888ns cpu_ghz=4.008
In the second instance the CPU core has transitioned out of power saving states.
Kernel Build Configuration
Ensure that your kernel is built with
CONFIG_NO_HZ_FULL=y. This setting will allow you to run the kernel in fully tickless mode on your performance cores. Timer ticks from the kernel will interrupt your process causing it to have latency spikes. To check if your kernel supports full tickless behaviour examine the kernel config file, e.g.:
cat /boot/config-4.10.11-100.fc24.x86_64 | grep NO_HZ_FULL CONFIG_NO_HZ_FULL=y
Kernel Boot Configuration
When testing software performance, ensure that the kernel boot configuration is configured for realtime performance. This is usually done by modifying /etc/default/grub. Following is an example:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 intel_idle.max_cstate=0 irqaffinity=0,1 selinux=0 audit=0 tsc=reliable"
isolcpus=2,3causes the scheduler to remove CPUs 2 and 3 from the scheduling pool
nohz_full=2,3causes CPUs 2 and 3 to run in fully tickless mode
rcu_nocbs=2,3stops RCU callbacks to these cores
intel_idle.max_cstate=0disables the intel_idle and fall back mode on acpi_idle
irqaffinity=0,1sets the default IRQ mask to cores 0 and 1
selinux=0disable the SE Linux extensions
audit=0disable kernel auditing system
tsc=reliablemarks the tsc clocksource as reliable, this disables clocksource verification at runtime
After regenerating the boot image and rebooting, you can check that this command has taken effect by running the below command, you should see the parameters above
$ cat /proc/cmdline
See Linux kernel parameters documentation for more information.
Make sure the ExaNIC is plugged into a PCIe x8 Gen 3 slot and is running @ 8.0 GT/s per lane (for systems that support PCIe Gen3) . This can be identified by running the lspci command and looking at the LnkSta (link status) output.
$ sudo lspci -d 1ce4:* -vvv |grep LnkSta: LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Make sure the ExaNIC is plugged into a PCIe slot directly connected to a CPU. The server or motherboard documentation should indicate which slots are connected to CPUs and which are connected to the chipset. If unsure, the following procedure can be used. First determine the bus number of the ExaNIC from lspci:
$ sudo lspci -d 1ce4:* 02:00.0 Ethernet controller: Exablaze ExaNIC X10
In this case, the bus number is
02. Now search for the device that has
secondary=02 in the
output of lspci -v, for example:
$ sudo lspci -v ... 00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 ...
For optimal performance, this should be a processor root port (in this case, “Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port”).
The simplest starting point is to install a loopback cable from the first two ports of an ExaNIC device. (i.e. exanic0:0 to exanic0:1)
With the cable present
exanic-config should report
SFP present and
signal detected for the two ports. If this is not the case confirm the SFP's are completely inserted and if possible replace with a known good cable.
The next step is to validate that the speeds of the two ports are set to match, again run
exanic-config and confirm that the two
Port speed: values match. For best benchmarking results this should be the highest available speed. e.g. 10,000Mbps. If the speeds differ or are slow you can change them by running
exanic-config exanic0:0 speed 10000 exanic-config exanic0:1 speed 10000
Next confirm that the
Port status: values are enabled. If they are not execute the
exanic-config exanic0:0 up command.
The final exanic-config output for both ports should look similar to:
Port 0: Interface: enp1s0 Port speed: 10000 Mbps Port status: enabled, SFP present, signal detected, link active
Port 1: Interface: enp1s0d1 Port speed: 10000 Mbps Port status: enabled, SFP present, signal detected, link active
For simple loopback tests, also ensure bypass only mode is used:
exanic-config exanic0:0 bypass-only on
There are a number of configuration options for Linux that will improve realtime performance.
We recommend the following:
cpus="2 3" echo -1 > /proc/sys/kernel/sched_rt_runtime_us echo 0 > /proc/sys/kernel/watchdog echo 0 > /proc/sys/kernel/nmi_watchdog echo 3 > /proc/irq/default_smp_affinity for irq in `ls /proc/irq/`; do echo 1 > /proc/irq/$irq/smp_affinity; done for irq in `ls /proc/irq/`; do echo -n "$irq "; cat /proc/irq/$irq/smp_affinity_list; done for cpu in $cpus do echo "performance" > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor echo 0 >/sys/devices/system/machinecheck/machinecheck$cpu/check_interval done
The above code:
- disables the Linux realtime throttling which ensures that realtime processes cannot starve the CPUS.
- disables the Linux watchdog timer which is used to detect and recover from software faults.
- disables the debugging feature for catching hardware hangings.
- sets the default CPU affinit of 0b11 (3), which means that only CPU 0 and 1 handle interrupts.
- moves all interrupts off cpu 2 and 3.
For more information, see Improving Linux Realtime Properties for more information.
Benchmarking latency with libexanic (raw frames)
When running a benchmark, pin the process to one of the isolated cores as follows:
$ sudo taskset -c 2 ./exanic_loopback exanic0 0 0 64 1000;
These steps should ensure best results.
Benchmarking latency with exasock (UDP/TCP)
We use sockperf for testing because it is open-source and well understood. Before testing UDP/TCP, please ensure raw frames are working correctly (as from above).
Start by downloading the sockperf source from the Github repository, and then build the application by running:
$ ./autogen.sh $ ./configure --prefix= $ make $ make install
Turn off bypass and local loopback:
$ exanic-config exanic0:0 bypass-only off $ exanic-config exanic0:0 local-loopback off
Set up a second machine with the configuration options from the steps above, and connect ports 0 together with another ExaNIC using a short fibre/direct attach cable.
Then, set up IP's on both hosts with:
$ ifconfig <interface> <ip-address> netmask <mask>
Run accelerated TCP/UDP sockperf on client and server:
server# exasock taskset -c 2 sockperf sr -i 10.10.0.2 client# exasock taskset -c 2 sockperf pp -i 10.10.0.2 -t5 -m 14 sockperf: == version #3.1-16.gitc6a0d0e3ab53 == sockperf: [Total Run] RunTime=5.450 sec; SentMessages=2882949; sockperf: ====> avg-lat= 0.932 (std-dev=0.034) sockperf: Summary: Latency is 0.932 usec sockperf: ---> <MAX> observation = 3.266 sockperf: ---> percentile 99.999 = 1.309 sockperf: ---> percentile 99.900 = 1.150 sockperf: ---> percentile 99.000 = 1.060 sockperf: ---> percentile 90.000 = 0.969 sockperf: ---> percentile 75.000 = 0.945 sockperf: ---> percentile 50.000 = 0.925 sockperf: ---> percentile 25.000 = 0.912 sockperf: ---> <MIN> observation = 0.862
RESULT: ExaSock UDP 1/2RTT latency <880ns.
This page was last updated on Oct-12-2019.