The Cisco Nexus SmartNIC (formerly ExaNIC) Sockets acceleration library allows applications to benefit from the low latency of direct access to the SmartNIC without requiring modifications to the application. This is achieved by intercepting calls to the Linux socket APIs.
While SmartNIC Sockets should be compatible with most applications using Linux socket APIs, there are some cases where programs may not work as expected. Feedback and bug reports would be greatly appreciated (create a case with Cisco TAC).
Software installation
Build the SmartNIC driver and libraries as per the SmartNIC Installation and Configuration Guide. SmartNIC Sockets is built and installed as a standard component, and the exasock kernel module is loaded automatically when an SmartNIC interface is brought up.
Usage
First ensure that the application works without SmartNIC Sockets. All IP addresses should be configured as if you were running the application through the normal Linux network interface corresponding to the SmartNIC.
Then, to accelerate the application, simply prefix it with the exasock
command. For example, to run the UNIX netcat (nc
) utility to listen
for UDP datagrams on port 1234:
$ exasock nc -u -l 1234
Another simple example application that receives and sends UDP multicast
datagrams is located in the SmartNIC source code distribution
(examples/exasock/multicast-echo.c
). Note that this is a normal Linux
sockets application that can be run either with or without the SmartNIC
Sockets acceleration library.
Sometimes it can be difficult to determine if the kernel bypass is
functioning correctly. Setting the EXASOCK_DEBUG
environment variable
prints extra debugging information that can help. For example:
$ EXASOCK_DEBUG=1 exasock nc -u -l 1234
exasock: enabled bypass on fd 4
In this case, the message exasock: enabled bypass on fd 4
indicates
that kernel bypass has been enabled for the socket associated with file
descriptor 4.
The exasock command itself can take several arguments:
--help
: display a message summarising these arguments--debug
: use the debug version oflibexasock
--trace
: record system calls, their arguments and their return values to stdout--no-warn
: turn off warning messages--no-auto
: disable acceleration on new sockets by default. It is still possible to opt-in to acceleration by callingsetsockopt()
withSO_EXA_NO_ACCEL
and supplying zero as the argument.
Displaying Exanic sockets accelerated connections
To provide insight into the current SmartNIC accelerated socket
connections the utility exasock-stat
is provided. By default running
exasock-stat
will display all accelerated UDP and TCP, listening and
connected sockets.
The exasock-stat
application was introduced with the v2.0.0 SmartNIC
driver and software package and can be found within the util
directory. This application will build as part of the utils build
however the libnl3-devel
package is not present by default in which
case building exasock-stat
will be skipped. To ensure it can be built
first run the appropriate command to install the missing libnl-3-dev
package e.g. sudo apt-get install libnl-3-dev
or sudo yum install
libnl3-devel.x86_64
then re-run make && make install
.
Then when running:
exasock-stat
you should see a table similar to the following:
Active SmartNIC Sockets accelerated connections (servers and established):
Proto | Recv-Q | Send-Q | Local Address | Foreign Address | State
UDP | 0 | 0 | 192.168.10.10:12345 | *:* | -
The columns shown are:
Proto:
The protocol used by the socket (TCP or UDP)
Recv-Q:
Connected: The count of bytes not copied by the user program
connected to this socket
Listening: The count of connections waiting to be accepted by the
user program
Send-Q:
Connected: The count of bytes not acknowledged by the remote host
Listening: N/A
Local Address:
Address and port number of the local end of the socket
Foreign Address:
Address and port number of the remote end of the socket
State:
The state of the socket
Extended Output not shown (-e/--extend enabled):
User:
The username or the user id (UID) of the owner of the socket
PID:FD:
PID of the process that owns the socket and value of the socket's
file descriptor
Program:
Process name of the process that owns the socket
Exactly what the application displays can be controlled by providing
arguments from the command line. To see the arguments available run
exasock-stat --help
Disabling acceleration per-socket
If only the SmartNIC Sockets acceleration library is used, then each
socket bound to either an SmartNIC interface or to a wildcard address
(INADDR_ANY
) gets automatically accelerated (i.e. the kernel is
bypassed to allow direct access to the SmartNIC).
As of exasock version 2.0.0 it is possible to disable default acceleration on a given socket, even if bound to an SmartNIC interface (or bound to a wildcard address or joined a multicast group with an SmartNIC interface).
In order to use this feature the application is required to include the
<exasock/socket.h>
header file and to disable the acceleration as
needs be for each socket. This is done by either setting the exasock
private SO_EXA_NO_ACCEL
socket option, or alternatively by calling the
exasock_disable_acceleration()
helper function.
Disabling acceleration on a socket is not allowed if the socket has already been accelerated (either by binding it to an SmartNIC interface or joining a multicast group with an SmartNIC interface).
Once acceleration has been disabled on a socket, it can no longer be re-enabled.
Documentation for both the exasock private socket option and the helper
function can be found in the header file <exasock/socket.h>
.
Multicast sockets
Versions of exasock older than 2.0.0 automatically accelerate each
socket bound to a multicast address. Newer versions (2.0.0 and beyond)
accelerate a multicast socket only if joined a multicast group (via
IP_ADD_MEMBERSHIP
socket option) with an SmartNIC interface. For any
accelerated socket exasock version 2.0.0 (or later) receives multicast
packets only from the interface with which the socket has joined the
multicast group.
Warning
exasock 2.0.0 and later:
If a socket bound to a wildcard address (INADDR_ANY
) is to be
used for receiving multicast traffic, it is worth to keep in
mind that it will always be accelerated. Multicast packets are
going to be discarded on this socket unless it has been set with
IP_ADD_MEMBERSHIP
option to join given multicast group and
multicast packets are arriving through the SmartNIC interface
specified in the IP_ADD_MEMBERSHIP
configuration.
Warning
exasock 2.0.0 and later:
If a socket bound to a multicast address but not associated with
an SmartNIC interface through IP_ADD_MEMBERSHIP
option is to be
used, then it will not get accelerated. It will receive
multicast packets through the native kernel networking stack
instead.
TCP acceleration
exasock 1.7.0 and later include a library called exasock_ext
that
allows user applications to detect that they are running under exasock
and access functionality beyond the standard Linux socket calls. In
particular, the current version provides a TCP acceleration feature that
allows programmers to achieve even lower TCP latencies than possible
with the normal send/sendto/sendmsg APIs.
Using this TCP acceleration feature, an application can construct partial or complete TCP packets ahead of time. These pre-built packets can then be transmitted through the lower level libexanic library, or can even be pushed to a user FPGA application on the SmartNIC card for ultra-low latency responses to triggers. To learn more about this method, see the section on TX preloading.
Since some of the setup for TX preloading needs to know the port number, the following function is provided for convenience when trying to find out which SmartNIC name and port number correspond to a given file descriptor:
int exasock_tcp_get_device(int fd, char *dev, size_t dev_len, int *port_num);
fd
is the file descriptor you want to know more about. dev
is a
buffer in which to put the SmartNIC name, and dev_len
is the amount of
space available in that buffer. port_num
is also a return parameter,
and will contain the port number associated with the file
descriptor. This function returns 0 on success and -1 on error, in which
case errno
will be set appropriately.
The following function constructs a header for the frame you want to send, and inserts it into the buffer provided:
ssize_t exasock_tcp_build_header(int fd, void *buf, size_t len, size_t offset,
int flags);
Note that this builds not only the TCP header, but the IP and Ethernet
headers as well. fd
is the file descriptor for the connection
(e.g. that returned by accept
). buf
is the buffer to be used for the
header data, and len
is the length of that buffer. offset
and
flags
are currently unused, and should be set to 0.
The following functions sets the length in the IP field, and calculates
the IP checksum. Note that this function assumes the IP and TCP headers
have no added options, which will be the case for headers generated by
exasock_tcp_build_header
.
int exasock_tcp_set_length(void *hdr, size_t hdr_len, size_t data_len);
hdr
is a pointer to some header bytes, and hdr_len
is the number of
bytes available after that pointer. data_len
is the length of the
payload. The return value for this function is normally zero, and -1 if
an error has occurred.
The following function calculates the TCP checksum for the header and data pointed to, and places it in the header:
int exasock_tcp_calc_checksum(void *hdr, size_t hdr_len,
const void *data, size_t data_len);
hdr
is a pointer to the header bytes of the frame, and hdr_len
is
the number of bytes valid after this address. data
is the TCP payload,
and data_len
is the length of the payload. The function returns 0 on
success, and -1 if an error has occurred.
Warning
Don't modify the frame between exasock_tcp_calc_checksum
and
exasock_tcp_send_advance
.
Warning
If you are using the exasock TCP extension then you need to
ensure that if you have prepared a TCP frame for transmission,
that it is the next frame transmitted. If, for example, another
TCP frame is sent via exasock using send()
, then the prepared
TCP frame will have the incorrect sequence number etc., and so
must be discarded.
The following function is intended to be called after your frame has been sent. It will update Exasock's TCP state to account for the packet that was just transmitted by copying the just-sent data to the retransmission buffer, and updating its sequence numbers.
int exasock_tcp_send_advance(int fd, const void *data, size_t data_len);
fd
is the file descriptor for the TCP connection, data
is the
payload that was transmitted, and data_len
is the length of the
payload.
There is an example tying these concepts together available in
src/examples/exasock/tcp-raw-send.c
.
Frame warming
Calling send
, sendto
or sendmsg
under Exasock with the
MSG_EXA_WARM
flag present in the flags
argument will result in the
message being aborted as close possible to the end of the TX code
path. This is intended to warm the cache.
Note that this flag was introduced in Exasock 2.2.0. Using this flag in prior versions will cause the frame to be sent as normal. As such, you should only use this flag in versions of Exasock after 2.2.0. You can check the Exasock version in your program:
if (exasock_version_code() < EXASOCK_VERSION(2,2,0)) {
// do something else
}
Known issues and limitations
- Each thread that calls a blocking I/O call - e.g.
select()
,poll()
,epoll_wait()
,recv()
,read()
oraccept()
- will spin waiting on data. This normally provides optimal latency but can induce performance problems if there are more threads than available CPUs. Other blocking modes will be provided in the future. - If a socket is bound to a wildcard address (
INADDR_ANY
), it will only receive packets that arrive on SmartNIC interfaces when run with the acceleration library. - If exasock version older than 2.0.0 is used and a socket is bound to
a multicast address, or exasock version 2.0.0 or newer is used and
a socket has joined a multicast group (
IP_ADD_MEMBERSHIP
socket option) with an SmartNIC interface, it will only receive packets that arrive on SmartNIC interfaces when run with the acceleration library. - Connecting to an accelerated socket from the same host is not supported (for example, if a socket is bound to 192.168.1.1:80, then it is not possible to connect to 192.168.1.1:80 from the local host).
- Transmitted multicast datagrams are not looped back to local sockets.
- The
MSG_WAITALL
flag torecv()
is not currently supported (to be resolved). - No support for recursive addition of epoll file descriptors to epoll sets.
- No support for IP fragmentation.
- Sockets may not be correctly maintained across
fork()
orexecve()
. - Sockets cannot be transferred to other processes with
sendmsg()
. recvmmsg()
duplicates the Linux behavior of only checking the timeout after the receipt of each datagram, so that if up to vlen-1 datagrams received before the timeout expires, but then no further datagrams are received, the call will block forever.
Tips for best performance
- Wherever possible, do not mix accelerated sockets with
non-accelerated sockets and other file descriptors in
select()
andpoll()
calls. - For the best possible performance, pin threads to CPU cores in the CPU socket directly connected to the SmartNIC.
This page was last updated on Apr-06-2021.