Exablaze logo

On an Intel Ivy Bridge test system with an ExaNIC X10, the median latency from application to network to application is 780 nanoseconds for small packets. This increases with frame size in a way that's approximately ideal, ie if you send a 164 byte packet you need to add 100 bytes at 10G line rate to this latency figure. This will vary with architecture but see below for benchmarking setup, utilities and trouble shooting.

Following is a general guide to optimizing your system for performance benchmarks.

In general it is advised when performing initial benchmarking investigations to setup with as little equipment as possible. We suggest a simple looped back cable for initial validation tests without other switches or network equipment in series.

Running benchmarking utilities

A number of benchmarking utilities, both for the ExaNIC and for other cards, are located in the perf-test directory provided with the distribution. To build these benchmark utilities for ExaNIC:

$ cd perf-test
$ make exanic

The simplest benchmark to run is exanic_loopback, which sends packets out from one port and receives them on another.

$ ./exanic_loopback --help
  sends a packet out from one port and waits for it on another, reporting timing statistics
  usage: exanic_loopback [-r] device tx_port rx_port data_size count

With the cable connected and link up from port 0 to port 1 run for example (to obtain 1000 samples at 64 byte frame size):

$ ./exanic_loopback exanic0 0 1 64 1000
min=737ns median=770ns max=988ns first=2423ns cpu_ghz=3.492

BIOS Configuration.

Turn off hyperthreading, speedstep, power saving, and any other energy saving settings that may be turned on. These can cause poor latency while the CPU ramps up. It is sometimes possible to identify if this is happening by looking at the cpuinfo. To do this run:

$cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
stepping    : 9
microcode   : 0x1c
cpu MHz     : 1600.00

In the above example the CPU is running in a low power state at 1600Mhz, which is below its 3500Mhz nominal speed.

Another way is to check the output of running the loopback test back-to-back e.g.

# ./exanic_loopback exanic0 0 1 64 1000; ./exanic_loopback exanic0 0 1 64 100000                         
min=1319ns median=1425ns max=26220ns first=19390ns cpu_ghz=4.007
min=819ns median=891ns max=8187ns first=3888ns cpu_ghz=4.008

In the second instance the CPU core has transitioned out of power saving states.

Kernel Build Configuration

Ensure that your kernel is built with CONFIG_NO_HZ_FULL=y. This setting will allow you to run the kernel in fully tickless mode on your performance cores. Timer ticks from the kernel will interrupt your process causing it to have latency spikes. To check if your kernel supports full tickless behaviour examine the kernel config file, e.g.:

cat /boot/config-4.10.11-100.fc24.x86_64 | grep NO_HZ_FULL

Kernel Boot Configuration

When testing software performance, ensure that the kernel boot configuration is configured for realtime performance. This is usually done by modifying /etc/default/grub. Following is an example:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 intel_idle.max_cstate=0 irqaffinity=0,1 selinux=0 audit=0 tsc=reliable"
  • isolcpus=2,3 causes the scheduler to remove CPUs 2 and 3 from the scheduling pool
  • nohz_full=2,3 causes CPUs 2 and 3 to run in fully tickless mode
  • rcu_nocbs=2,3 stops RCU callbacks to these cores
  • intel_idle.max_cstate=0 disables the intel_idle and fall back mode on acpi_idle
  • irqaffinity=0,1 sets the default IRQ mask to cores 0 and 1
  • selinux=0 disable the SE Linux extensions
  • audit=0 disable kernel auditing system
  • tsc=reliable marks the tsc clocksource as reliable, this disables clocksource verification at runtime

After regenerating the boot image and rebooting, you can check that this command has taken effect by running the below command, you should see the parameters above

$ cat /proc/cmdline

See Linux kernel parameters documentation for more information.

Hardware Configuration

Make sure the ExaNIC is plugged into a PCIe x8 Gen 3 slot and is running @ 8.0 GT/s per lane (for systems that support PCIe Gen3) . This can be identified by running the lspci command and looking at the LnkSta (link status) output.

$ sudo lspci -d 1ce4:* -vvv |grep LnkSta:
LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Make sure the ExaNIC is plugged into a PCIe slot directly connected to a CPU. The server or motherboard documentation should indicate which slots are connected to CPUs and which are connected to the chipset. If unsure, the following procedure can be used. First determine the bus number of the ExaNIC from lspci:

$ sudo lspci -d 1ce4:*
02:00.0 Ethernet controller: Exablaze ExaNIC X10

In this case, the bus number is 02. Now search for the device that has secondary=02 in the output of lspci -v, for example:

$ sudo lspci -v
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core
processor PCI Express Root Port (rev 09) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=0

For optimal performance, this should be a processor root port (in this case, “Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port”).

Exanic Configuration

The simplest starting point is to install a loopback cable from the first two ports of an ExaNIC device. (i.e. exanic0:0 to exanic0:1) With the cable present exanic-config should report SFP present and signal detected for the two ports. If this is not the case confirm the SFP's are completely inserted and if possible replace with a known good cable.

The next step is to validate that the speeds of the two ports are set to match, again run exanic-config and confirm that the two Port speed: values match. For best benchmarking results this should be the highest available speed. e.g. 10,000Mbps. If the speeds differ or are slow you can change them by running

exanic-config exanic0:0 speed 10000

exanic-config exanic0:1 speed 10000

Next confirm that the Port status: values are enabled. If they are not execute the exanic-config exanic0:0 up command. The final exanic-config output for both ports should look similar to:

Port 0:
 Interface: enp1s0
 Port speed: 10000 Mbps
 Port status: enabled, SFP present, signal detected, link active


Port 1:
 Interface: enp1s0d1
 Port speed: 10000 Mbps
 Port status: enabled, SFP present, signal detected, link active

For simple loopback tests, also ensure bypass only mode is used: exanic-config exanic0:0 bypass-only on

Software Settings

There are a number of configuration options for Linux that will improve realtime performance.

We recommend the following:

cpus="2 3"

echo -1 > /proc/sys/kernel/sched_rt_runtime_us
echo 0 > /proc/sys/kernel/watchdog
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 3 > /proc/irq/default_smp_affinity 

for irq in `ls /proc/irq/`; do echo 1 > /proc/irq/$irq/smp_affinity; done
for irq in `ls /proc/irq/`; do echo -n "$irq  ";  cat /proc/irq/$irq/smp_affinity_list; done 

for cpu in $cpus
    echo "performance" > /sys/devices/system/cpu/cpu$cpu/cpufreq/scaling_governor
    echo 0 >/sys/devices/system/machinecheck/machinecheck$cpu/check_interval

The above code:

  • disables the Linux realtime throttling which ensures that realtime processes cannot starve the CPUS.
  • disables the Linux watchdog timer which is used to detect and recover from software faults.
  • disables the debugging feature for catching hardware hangings.
  • sets the default CPU affinit of 0b11 (3), which means that only CPU 0 and 1 handle interrupts.
  • moves all interrupts off cpu 2 and 3.

For more information, see Improving Linux Realtime Properties for more information.

Benchmarking latency with libexanic (raw frames)

When running a benchmark, pin the process to one of the isolated cores as follows:

$ sudo taskset -c 2 ./exanic_loopback exanic0 0 0 64 1000;

These steps should ensure best results.

Benchmarking latency with exasock (UDP/TCP)

We use sockperf for testing because it is open-source and well understood. Before testing UDP/TCP, please ensure raw frames are working correctly (as from above).

Start by downloading the sockperf source from the Github repository, and then build the application by running:

$ ./autogen.sh 
$ ./configure --prefix= 
$ make 
$ make install  

Turn off bypass and local loopback:

$ exanic-config exanic0:0 bypass-only off
$ exanic-config exanic0:0 local-loopback off

Set up a second machine with the configuration options from the steps above, and connect ports 0 together with another ExaNIC using a short fibre/direct attach cable.

Then, set up IP's on both hosts with:

$ ifconfig <interface> <ip-address> netmask <mask>

Run accelerated TCP/UDP sockperf on client and server:

server# exasock taskset -c 2 sockperf sr -i
client# exasock taskset -c 2  sockperf pp -i -t5 -m 14
sockperf: == version #3.1-16.gitc6a0d0e3ab53 == 
sockperf: [Total Run] RunTime=5.450 sec; SentMessages=2882949;
sockperf: ====> avg-lat=  0.932 (std-dev=0.034)
sockperf: Summary: Latency is 0.932 usec
sockperf: ---> <MAX> observation =    3.266
sockperf: ---> percentile 99.999 =    1.309
sockperf: ---> percentile 99.900 =    1.150
sockperf: ---> percentile 99.000 =    1.060
sockperf: ---> percentile 90.000 =    0.969
sockperf: ---> percentile 75.000 =    0.945
sockperf: ---> percentile 50.000 =    0.925
sockperf: ---> percentile 25.000 =    0.912
sockperf: ---> <MIN> observation =    0.862

RESULT: ExaSock UDP 1/2RTT latency <880ns.

This page was last updated on Oct-12-2019.