Rate this page

Flattr this

Capture Ethernet frames using an AF_PACKET ring buffer in C

Tested on

Ubuntu (Lucid, Trusty, Xenial)

Objective

To capture all frames received by a given Ethernet interface using an AF_PACKET socket with an associated memory-mapped ring buffer

Background

The purpose of an AF_PACKET socket is to allow network communication at the link layer, for example to receive or transmit raw Ethernet frames. For basic usage see the microHOWTO Capture Ethernet frames using an AF_PACKET socket in C.

It is possible to receive traffic from an AF_PACKET socket using the same system calls as for any other type of socket, however this can be inefficient due to the need for at least one system call per frame received. Ring buffers are areas of memory which allow frames to be passed from kernel to user space without the use of system calls. This allows for higher throughput and reduced risk of packet loss.

The ring buffer API has progressed through several versions:

These are not backwards-compatible, so the API selected must exactly match the one that the code was written for.

Scenario

Suppose you wish to capture all frames received by all interfaces using the TPACKET_V1 ring buffer API.

(Use of TPACKET_V2 is considered as a variation below. Use of TPACKET_V3 will be the subject of a future microHOWTO. Capture of specific LinkTypes, or from specific interfaces, is described in the microHOWTO referenced above.)

Method

Overview

The method described here has eight steps:

  1. Create the AF_PACKET socket.
  2. Select the ring buffer API version required.
  3. Create the ring buffer.
  4. Map the ring buffer into memory.
  5. Repeatedly iterate through the ring buffer.
  6. Wait for a frame to become ready for processing.
  7. Handle the received frame.
  8. Return ownership of the frame buffer to the kernel.

The following header files are used:

Header Used by
<stdlib.h> exit
<stdio.h> perror
<unistd.h> sysconf, _SC_PAGESIZE, _SC_PHYS_PAGES
<poll.h> struct pollfd, poll, POLLIN
<arpa/inet.h> htons
<net/ethernet.h> ETH_P_ALL, ETH_HLEN
<linux/if_packet.h> struct sockaddr_ll, struct tpacket_req, struct tpacket*_hdr, TPACKET_V*, PACKET_VERSION, PACKET_RX_RING, TPACKET_ALIGN, TPACKET*_HDRLEN, TP_STATUS_*
<sys/socket.h> AF_PACKET, SOCK_RAW, SOCK_DGRAM, socket, SOL_PACKET
<sys/mman.h> mmap, PROT_*, MAP_SHARED

Older versions of the packet(7) manpage specify inclusion of the header <netpacket/packet.h>, however this has since been changed to <linux/if_packet.h>. The latter has historically been kept more up to date than the former, and is the better choice under most circumstances.

AF_PACKET sockets are specific to Linux. Programs that make use of them need elevated privileges in order to run.

Create the AF_PACKET socket

Use of a ring buffer does not make any difference to how the AF_PACKET socket is created:

int fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (fd == -1) {
    perror("socket");
    exit(1);
}

Select the ring buffer API version required

This step has been included here for completeness, however it should be omitted when TPACKET_V1 is required, since that version is the default. Explicitly selecting TPACKET_V1 does no harm on kernels which allow this to be done, however it will prevent the program from compiling or (depending on how errors are handled) running in environments which pre-date TPACKET_V2.

The API version is selected using a socket option named PACKET_VERSION at level SOL_PACKET. It takes an argument of type int which contains one of the constants TPACKET_V1, TPACKET_V2 or TPACKET_V3:

int version = TPACKET_V1;
if ((setsockopt(fd, SOL_PACKET, PACKET_VERSION, &version, sizeof(version))) == -1) {
    perror("setsockopt");
    exit(1);
}

If the requested API version is unavailable then setsockopt will return an error. You could then try again with a lower version number, provided that the program has support for the APIs in question (which, as noted above, are not backwards-compatible). Failing that, you could fall back further to using recvfrom or recvmsg.

Create the ring buffer

A ring buffer is composed of a number of blocks, each of which is a contiguous region of physical memory. Each block is divided into a number of frames, each of which holds one captured Ethernet frame.

The ring buffer is created using the PACKET_RX_RING socket option at the SOL_PACKET level. It takes as its argument a structure of type struct tpacket_req, which has four fields. These define the geometry of the ring buffer in terms of what are called ‘blocks’ and ‘frames’:

The frame size is determined by the maximum frame size you wish to capture. It is necessary to allow additional space for header data at the start of each frame which contains a tpacket_hdr or equivalent, a sockaddr_ll, and two areas of padding needed for alignment. For TPACKET_V1, the required frame size is:

struct tpacket_req req = {0};
req.tp_frame_size = TPACKET_ALIGN(TPACKET_HDRLEN + ETH_HLEN) + TPACKET_ALIGN(snaplen);

where ETH_HLEN is equal to 14 if you are capturing Ethernet headers, or omitted from the calculation if you are not, and snaplen is the maximum number of bytes of network-layer data to be captured (so 1500 for standard Ethernet). For other versions of the API replace TPACKET_HDRLEN with TPACKET2_HDRLEN or TPACKET3_HDRLEN as appropriate.

The block size must not be smaller than the frame size. It should also be a power-of-two multiple of the page size for the system, as reported by sysconf(_SC_PAGESIZE) or getpagesize(), as any excess would be wasted. However, if it is made too large then this could exceed architectural limits, or conceivably it might not be possible to arrange for that number of contiguous pages to be made available. A reasonable policy would be to choose the smallest power-of-two multiple of the page size which is not less than the frame size, or perhaps a small multiple of the frame size if packing density is a concern:

req.tp_block_size = sysconf(_SC_PAGESIZE);
while (req.tp_block_size < req.tp_frame_size) {
    req.tp_block_size <<= 1;
}

The number of blocks can be chosen freely, subject to physical and architectural limits, and the extent to which you can afford to deprive other processes of access to physical memory. Note that since the memory reserved cannot be swapped, the impact of a large allocation on system performance could be much greater than an equivalent amount of virtual memory. However, you are not necessarily limited to the number of physical pages which are free at the time, since existing users could be evicted by the allocation.

For some applications it may make sense to ask not how much physical memory is needed, but rather how much is available. For example, if you wished to avoid packet loss without much regard for any other processes running on the machine then it might be reasonable to allocate half of the available physical memory:

req.tp_block_nr = sysconf(_SC_PHYS_PAGES) * sysconf(_SC_PAGESIZE) / (2 * req.tp_block_size);

Finally, tp_frame_nr must be set to the highest possible value that is consistent with the other three parameters, allowing for the constraint that frames cannot cross block boundaries (so the number of frames per block must be an integer):

size_t frames_per_buffer = req.tp_block_size / req.tp_frame_size;
req.tp_frame_nr = req.tp_block_nr * frames_per_buffer;

Once all four fields of the request struture have been populated, it can be passed to setsockopt:

if (setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &req, sizeof(req))==-1) {
    perror("setsockopt");
    exit(1);
}

Map the ring buffer into memory

The socket option above causes the ring buffer to be allocated in physical memory, but this does not by itself make it accessible to the calling process. For that it must be mapped into the virtual address space of the process. This can be done using mmap in the same manner as a regular file:

size_t rx_ring_size = req.tp_block_nr * req.tp_block_size;
char* rx_ring = mmap(0, rx_ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

The combination of PROT_READ and PROT_WRITE requests that both read and write access to the buffer be granted. This is necessary even though only an RX ring is being created in this example (with no TX ring), because the receiving process must be able to write to the buffer to inform the kernel when frames can be reused.

The mapping must be performed using MAP_SHARED to ensure that writes by the process propagate to the kernel and vice-versa.

Note that the ring buffer will be presented by mmap as a contiguous region of virtual memory, even though it is most likely not contiguous in physical memory. This does not mean that the process can be entirely oblivious to the fact that the buffer is composed of blocks, however it does transform what would otherwise have been an irregular structure into a regular one.

Repeatedly iterate through the ring buffer

As the name suggests, the frames within the ring buffer are accessed as if they were arranged in a circle with the first frame being considered adjacent to the last. Both the kernel and the user process advance forwards around the ring, starting at the first frame. The user process must keep track of the frame it has reached within the ring. Since it is possible that the frames are not evenly spaced, it is helpful to have both an index and a pointer for this purpose:

size_t frame_idx = 0;
char* frame_ptr = rx_ring;

At the end of each iteration the index should be incremented modulo the number of frames, and the pointer recalculated to match the index. For clarity, this recalculation has been broken down into multiple steps:

while (true) {

    /* (handle frame) */

    // Increment frame index, wrapping around if end of buffer is reached.
    frame_idx = (frame_idx + 1) % req.tp_frame_nr;

    // Determine the location of the buffer which the next frame lies within.
    int buffer_idx = frame_idx / frames_per_buffer;
    char* buffer_ptr = rx_ring + buffer_idx * req.tp_block_size;

    // Determine the location of the frame within that buffer.
    int frame_idx_diff = frame_idx % frames_per_buffer;
    frame_ptr = buffer_ptr + frame_idx_diff * req.tp_frame_size;
}

Wait for a frame to become ready for processing

Each frame begins with a tpacket_hdr structure, which contains a flag for indicating whether the frame is ready for processing. The flag is located in the tp_status field, and can be tested using a bitmask named TP_STATUS_USER. While the bit is clear, the frame is considered to be owned by the kernel and should not be accessed by the user process. While it is set, it is considered to be owned by the user process and will not be changed by the kernel.

It would be possible to busy-wait on this flag, however that would be obviously inefficient. The alternative is to use a system call such as poll to block execution until activity is seen on the file descriptor, however you would not want to make a system call for every packet captured as this defeats the point of using a ring buffer. The course of action recommended here is therefore to initially test the status of the frame using the flag, then block if and only if the frame is not yet ready:

struct pollfd fds[1] = {0};
fds[0].fd = fd;
fds[0].events = POLLIN;

struct tpacket_hdr* tphdr = (struct tpacket_hdr*)frame_ptr;
while (!(tphdr->tp_status & TP_STATUS_USER)) {
    if (poll(fds, 1, -1) == -1) {
        perror("poll");
        exit(1);
    }
}

This allows frames to be processed with minimal hinderance under heavy load, which is when performance matters, whilst avoiding unnecessary load when waiting for traffic.

The poll function used above takes three arguments:

In this instance there is only one file descriptor to monitor, and the only event listed as being of interest is POLLIN which indicates that there is data available to be read.

Handle the received frame

Each frame has three structures within it:

There may be padding between these components, which must be allowed for when calculating the pointer offsets. To assist with this, the tpacket_hdr structure contains offets for the link-layer header (if it was requested) and the network-layer packet content in the tp_mac and tp_net fields respectively:

struct sockaddr_ll* addr = (struct sockaddr_ll*)(frame_ptr + TPACKET_HDRLEN - sizeof(struct sockaddr_ll));
char* l2content = frame_ptr + tphdr->tp_mac;
char* l3content = frame_ptr + tphdr->tp_net;
handle_frame(tphdr, addr, l2content, l3content);

The link-layer and network-layer headers are adjacent to each other, so it is not necessary to use tp_net to access the network layer: you can instead use tp_mac to access the complete frame. The documentation does not appear to specify how tp_mac is set in the case where capture of the link-layer header was not requested, however current behaviour is for it to be set equal to tp_net.

In addition to the frame content itself, notable metadata which can be captured includes:

Return ownership of the frame buffer to the kernel

In order to avoid race conditions, the kernel will not overwrite the frame buffer with new content unless and until control of the buffer has been explicitly released by the user process. This is done by setting the status field to TP_STATUS_KERNEL:

tphdr->tp_status = TP_STATUS_KERNEL;

(As noted previously it is the TP_STATUS_USER flag which specifically controls ownership, however the API documentation explicitly specifies that the above assignment should be performed, and this has the effect of clearing all of the flags.)

Variations

Using TPACKET_V2

In order to use TPACKET_V2, the following changes must be made:

(These are not the only changes, but they are the ones which have a bearing on the method described above.)

Sample code

int fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (fd == -1) {
    perror("socket");
    exit(1);
}

struct tpacket_req req = {0};
req.tp_frame_size = TPACKET_ALIGN(TPACKET_HDRLEN + ETH_HLEN) + TPACKET_ALIGN(snaplen);
req.tp_block_size = sysconf(_SC_PAGESIZE);
while (req.tp_block_size < req.tp_frame_size) {
    req.tp_block_size <<= 1;
}
req.tp_block_nr = sysconf(_SC_PHYS_PAGES) * sysconf(_SC_PAGESIZE) / (2 * req.tp_block_size);
size_t frames_per_buffer = req.tp_block_size / req.tp_frame_size;
req.tp_frame_nr = req.tp_block_nr * frames_per_buffer;
if (setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &req, sizeof(req))==-1) {
    perror("setsockopt");
    exit(1);
}

size_t rx_ring_size = req.tp_block_nr * req.tp_block_size;
char* rx_ring = mmap(0, rx_ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

struct pollfd fds[1] = {0};
fds[0].fd = fd;
fds[0].events = POLLIN;
size_t frame_idx = 0;
char* frame_ptr = rx_ring;

while (true) {
    struct tpacket_hdr* tphdr = (struct tpacket_hdr*)frame_ptr;
    while (!(tphdr->tp_status & TP_STATUS_USER)) {
        if (poll(fds, 1, -1) == -1) {
            perror("poll");
            exit(1);
        }
    }

    struct sockaddr_ll* addr = (struct sockaddr_ll*)(frame_ptr + TPACKET_HDRLEN - sizeof(struct sockaddr_ll));
    char* l2content = frame_ptr + tphdr->tp_mac;
    char* l3content = frame_ptr + tphdr->tp_net;
    handle_frame(tphdr, addr, l2content, l3content);

    frame_idx = (frame_idx + 1) % req.tp_frame_nr;
    int buffer_idx = frame_idx / frames_per_buffer;
    char* buffer_ptr = rx_ring + buffer_idx * req.tp_block_size;
    int frame_idx_diff = frame_idx % frames_per_buffer;
    frame_ptr = buffer_ptr + frame_idx_diff * req.tp_frame_size;
}

Further reading

Tags: c | socket