The Cisco Nexus SmartNIC (formerly ExaNIC) Sockets acceleration library allows applications to benefit from the low latency of direct access to the SmartNIC without requiring modifications to the application. This is achieved by intercepting calls to the Linux socket APIs.

While SmartNIC Sockets should be compatible with most applications using Linux socket APIs, there are some cases where programs may not work as expected. Feedback and bug reports would be greatly appreciated (create a case with Cisco TAC).

Software installation

Build the SmartNIC driver and libraries as per the SmartNIC Installation and Configuration Guide. SmartNIC Sockets is built and installed as a standard component, and the exasock kernel module is loaded automatically when an SmartNIC interface is brought up.

Usage

First ensure that the application works without SmartNIC Sockets. All IP addresses should be configured as if you were running the application through the normal Linux network interface corresponding to the SmartNIC.

Then, to accelerate the application, simply prefix it with the exasock command. For example, to run the UNIX netcat (nc) utility to listen for UDP datagrams on port 1234:

$ exasock nc -u -l 1234

Another simple example application that receives and sends UDP multicast datagrams is located in the SmartNIC source code distribution (examples/exasock/multicast-echo.c). Note that this is a normal Linux sockets application that can be run either with or without the SmartNIC Sockets acceleration library.

Sometimes it can be difficult to determine if the kernel bypass is functioning correctly. Setting the EXASOCK_DEBUG environment variable prints extra debugging information that can help. For example:

$ EXASOCK_DEBUG=1 exasock nc -u -l 1234
exasock: enabled bypass on fd 4

In this case, the message exasock: enabled bypass on fd 4 indicates that kernel bypass has been enabled for the socket associated with file descriptor 4.

The exasock command itself can take several arguments:

--help: display a message summarising these arguments
--debug: use the debug version of libexasock
--trace: record system calls, their arguments and their return values to stdout
--no-warn: turn off warning messages
--no-auto: disable acceleration on new sockets by default. It is still possible to opt-in to acceleration by calling setsockopt() with SO_EXA_NO_ACCEL and supplying zero as the argument.

Displaying Exanic sockets accelerated connections

To provide insight into the current SmartNIC accelerated socket connections the utility exasock-stat is provided. By default running exasock-stat will display all accelerated UDP and TCP, listening and connected sockets.

The exasock-stat application was introduced with the v2.0.0 SmartNIC driver and software package and can be found within the util directory. This application will build as part of the utils build however the libnl3-devel package is not present by default in which case building exasock-stat will be skipped. To ensure it can be built first run the appropriate command to install the missing libnl-3-dev package e.g. sudo apt-get install libnl-3-dev or sudo yum install libnl3-devel.x86_64 then re-run make && make install.

Then when running:

exasock-stat

you should see a table similar to the following:

Active SmartNIC Sockets accelerated connections (servers and established):
  Proto | Recv-Q   | Send-Q   | Local Address            | Foreign Address    | State
  UDP   | 0        | 0        | 192.168.10.10:12345      | *:*                | -

The columns shown are:

Proto:
   The protocol used by the socket (TCP or UDP)
Recv-Q:
   Connected: The count of bytes not copied by the user program
              connected to this socket
   Listening: The count of connections waiting to be accepted by the
              user program
Send-Q:
   Connected: The count of bytes not acknowledged by the remote host
   Listening: N/A
Local Address:
   Address and port number of the local end of the socket
Foreign Address:
   Address and port number of the remote end of the socket
State:
   The state of the socket


Extended Output not shown (-e/--extend enabled):
User:
   The username or the user id (UID) of the owner of the socket
PID:FD:
   PID of the process that owns the socket and value of the socket's
   file descriptor
Program:
   Process name of the process that owns the socket

Exactly what the application displays can be controlled by providing arguments from the command line. To see the arguments available run exasock-stat --help

Disabling acceleration per-socket

If only the SmartNIC Sockets acceleration library is used, then each socket bound to either an SmartNIC interface or to a wildcard address (INADDR_ANY) gets automatically accelerated (i.e. the kernel is bypassed to allow direct access to the SmartNIC).

As of exasock version 2.0.0 it is possible to disable default acceleration on a given socket, even if bound to an SmartNIC interface (or bound to a wildcard address or joined a multicast group with an SmartNIC interface).

In order to use this feature the application is required to include the <exasock/socket.h> header file and to disable the acceleration as needs be for each socket. This is done by either setting the exasock private SO_EXA_NO_ACCEL socket option, or alternatively by calling the exasock_disable_acceleration() helper function.

Disabling acceleration on a socket is not allowed if the socket has already been accelerated (either by binding it to an SmartNIC interface or joining a multicast group with an SmartNIC interface).

Once acceleration has been disabled on a socket, it can no longer be re-enabled.

Documentation for both the exasock private socket option and the helper function can be found in the header file <exasock/socket.h>.

Multicast sockets

Versions of exasock older than 2.0.0 automatically accelerate each socket bound to a multicast address. Newer versions (2.0.0 and beyond) accelerate a multicast socket only if joined a multicast group (via IP_ADD_MEMBERSHIP socket option) with an SmartNIC interface. For any accelerated socket exasock version 2.0.0 (or later) receives multicast packets only from the interface with which the socket has joined the multicast group.

Warning

exasock 2.0.0 and later: If a socket bound to a wildcard address (INADDR_ANY) is to be used for receiving multicast traffic, it is worth to keep in mind that it will always be accelerated. Multicast packets are going to be discarded on this socket unless it has been set with IP_ADD_MEMBERSHIP option to join given multicast group and multicast packets are arriving through the SmartNIC interface specified in the IP_ADD_MEMBERSHIP configuration.

Warning

exasock 2.0.0 and later: If a socket bound to a multicast address but not associated with an SmartNIC interface through IP_ADD_MEMBERSHIP option is to be used, then it will not get accelerated. It will receive multicast packets through the native kernel networking stack instead.

TCP acceleration

exasock 1.7.0 and later include a library called exasock_ext that allows user applications to detect that they are running under exasock and access functionality beyond the standard Linux socket calls. In particular, the current version provides a TCP acceleration feature that allows programmers to achieve even lower TCP latencies than possible with the normal send/sendto/sendmsg APIs.

Using this TCP acceleration feature, an application can construct partial or complete TCP packets ahead of time. These pre-built packets can then be transmitted through the lower level libexanic library, or can even be pushed to a user FPGA application on the SmartNIC card for ultra-low latency responses to triggers. To learn more about this method, see the section on TX preloading.

Since some of the setup for TX preloading needs to know the port number, the following function is provided for convenience when trying to find out which SmartNIC name and port number correspond to a given file descriptor:

int exasock_tcp_get_device(int fd, char *dev, size_t dev_len, int *port_num);

fd is the file descriptor you want to know more about. dev is a buffer in which to put the SmartNIC name, and dev_len is the amount of space available in that buffer. port_num is also a return parameter, and will contain the port number associated with the file descriptor. This function returns 0 on success and -1 on error, in which case errno will be set appropriately.

The following function constructs a header for the frame you want to send, and inserts it into the buffer provided:

ssize_t exasock_tcp_build_header(int fd, void *buf, size_t len, size_t offset,
                                 int flags);

Note that this builds not only the TCP header, but the IP and Ethernet headers as well. fd is the file descriptor for the connection (e.g. that returned by accept). buf is the buffer to be used for the header data, and len is the length of that buffer. offset and flags are currently unused, and should be set to 0.

The following functions sets the length in the IP field, and calculates the IP checksum. Note that this function assumes the IP and TCP headers have no added options, which will be the case for headers generated by exasock_tcp_build_header.

int exasock_tcp_set_length(void *hdr, size_t hdr_len, size_t data_len);

hdr is a pointer to some header bytes, and hdr_len is the number of bytes available after that pointer. data_len is the length of the payload. The return value for this function is normally zero, and -1 if an error has occurred.

The following function calculates the TCP checksum for the header and data pointed to, and places it in the header:

int exasock_tcp_calc_checksum(void *hdr, size_t hdr_len,
                              const void *data, size_t data_len);

hdr is a pointer to the header bytes of the frame, and hdr_len is the number of bytes valid after this address. data is the TCP payload, and data_len is the length of the payload. The function returns 0 on success, and -1 if an error has occurred.

Warning

Don't modify the frame between exasock_tcp_calc_checksum and exasock_tcp_send_advance.

Warning

If you are using the exasock TCP extension then you need to ensure that if you have prepared a TCP frame for transmission, that it is the next frame transmitted. If, for example, another TCP frame is sent via exasock using send(), then the prepared TCP frame will have the incorrect sequence number etc., and so must be discarded.

The following function is intended to be called after your frame has been sent. It will update Exasock's TCP state to account for the packet that was just transmitted by copying the just-sent data to the retransmission buffer, and updating its sequence numbers.

int exasock_tcp_send_advance(int fd, const void *data, size_t data_len);

fd is the file descriptor for the TCP connection, data is the payload that was transmitted, and data_len is the length of the payload.

There is an example tying these concepts together available in src/examples/exasock/tcp-raw-send.c.

Frame warming

Calling send, sendto or sendmsg under Exasock with the MSG_EXA_WARM flag present in the flags argument will result in the message being aborted as close possible to the end of the TX code path. This is intended to warm the cache.

Note that this flag was introduced in Exasock 2.2.0. Using this flag in prior versions will cause the frame to be sent as normal. As such, you should only use this flag in versions of Exasock after 2.2.0. You can check the Exasock version in your program:

if (exasock_version_code() < EXASOCK_VERSION(2,2,0)) {
    // do something else
}

Known issues and limitations

Each thread that calls a blocking I/O call - e.g. select(), poll(), epoll_wait(), recv(), read() or accept() - will spin waiting on data. This normally provides optimal latency but can induce performance problems if there are more threads than available CPUs. Other blocking modes will be provided in the future.
If a socket is bound to a wildcard address (INADDR_ANY), it will only receive packets that arrive on SmartNIC interfaces when run with the acceleration library.
If exasock version older than 2.0.0 is used and a socket is bound to a multicast address, or exasock version 2.0.0 or newer is used and a socket has joined a multicast group (IP_ADD_MEMBERSHIP socket option) with an SmartNIC interface, it will only receive packets that arrive on SmartNIC interfaces when run with the acceleration library.
Connecting to an accelerated socket from the same host is not supported (for example, if a socket is bound to 192.168.1.1:80, then it is not possible to connect to 192.168.1.1:80 from the local host).
Transmitted multicast datagrams are not looped back to local sockets.
The MSG_WAITALL flag to recv() is not currently supported (to be resolved).
No support for recursive addition of epoll file descriptors to epoll sets.
No support for IP fragmentation.
Sockets may not be correctly maintained across fork() or execve().
Sockets cannot be transferred to other processes with sendmsg().
recvmmsg() duplicates the Linux behavior of only checking the timeout after the receipt of each datagram, so that if up to vlen-1 datagrams received before the timeout expires, but then no further datagrams are received, the call will block forever.

Tips for best performance

Wherever possible, do not mix accelerated sockets with non-accelerated sockets and other file descriptors in select() and poll() calls.
For the best possible performance, pin threads to CPU cores in the CPU socket directly connected to the SmartNIC.

This page was last updated on Apr-06-2021.