Reputation: 11002
I have read the packet man page and a few blog | posts trying to understand how to use the PACKET_FANOUT socket option to scale the processing of received data (I am looking to use SOCK_RAW to capture traffic at high speeds, >10Gbps). I have read through this example code (copied below) but I'm not sure if I have fully understood it.
Lets imagine a scenario; RSS has been set up on the NIC and ingressing traffic is evenly distributed between RX queues, there is an 8 core CPU and 8 NIC RX queues, each RX queue [0-7] sends an interrupt to CPU [0-7] respectively (further discussion about MMAP, zero copy, poll() et al is off topic here).
This is the order of events as I see it in the example code:
setup_socket()
bound to the same physical NIC, in promisc mode, and all part of the same FANOUT group.read()
call is made to that socket (so socket 0 created by thread 0 only) and the data is then copied to the userland receive buffer for that thread only, because of this flag.Point number 4 is the main point of doubt in my understanding of this process. Have I understood correctly how scaling works with PACKET_FANOUT in this scenario and how we lock a worker thread to the same core processing the interrupt?
void start_af_packet_capture(std::string interface_name, int fanout_group_id) {
// setup_socket() calls socket() (using SOCK_RAW) to created the socketFD,
// setsockopt() to enable promisc mode on the NIC,
// bind() to bind the socketFD to NIC,
// and setsockopt() again to set PACKET_FANOUT + PACKET_FANOUT_CPU
int packet_socket = setup_socket(interface_name, fanout_group_id);
if (packet_socket == -1) {
printf("Can't create socket\n");
return;
}
unsigned int capture_length = 1500;
char buffer[capture_length];
while (true) {
received_packets++;
int readed_bytes = read(packet_socket, buffer, capture_length);
// printf("Got %d bytes from interface\n", readed_bytes);
consume_pkt((u_char*)buffer, readed_bytes);
if (readed_bytes < 0) {
break;
}
}
}
...
bool use_multiple_fanout_processes = true;
// Could get some speed up on NUMA servers
bool execute_strict_cpu_affinity = false;
int main() {
boost::thread speed_printer_thread( speed_printer );
int fanout_group_id = getpid() & 0xffff;
if (use_multiple_fanout_processes) {
boost::thread_group packet_receiver_thread_group;
unsigned int num_cpus = 8;
for (int cpu = 0; cpu < num_cpus; cpu++) {
boost::thread::attributes thread_attrs;
if (execute_strict_cpu_affinity) {
cpu_set_t current_cpu_set;
int cpu_to_bind = cpu % num_cpus;
CPU_ZERO(¤t_cpu_set);
// We count cpus from zero
CPU_SET(cpu_to_bind, ¤t_cpu_set);
int set_affinity_result = pthread_attr_setaffinity_np(thread_attrs.native_handle(), sizeof(cpu_set_t), ¤t_cpu_set);
if (set_affinity_result != 0) {
printf("Can't set CPU affinity for thread\n");
}
}
packet_receiver_thread_group.add_thread(
new boost::thread(thread_attrs, boost::bind(start_af_packet_capture, "eth6", fanout_group_id))
);
}
// Wait all processes for finish
packet_receiver_thread_group.join_all();
} else {
start_af_packet_capture("eth6", 0);
}
speed_printer_thread.join();
}
Edit: Bonus Question
This might be too unrelated in which case please advise and I will start a separate SO post. The aim here is not just to scale packet processing across multiple cores but also to place the packet processing code on the same core that receives that packet (later MMAP & RX_RING will be explored) so that there are fewer context switches and cache misses on the CPU. My understanding is that this goal is being achieved here, can someone please confirm or deny?
Upvotes: 5
Views: 4069
Reputation: 39075
PACKET_FANOUT_CPU
works differently. Each socket that joins a fanout group is pushed to the back of a group specific array. The kernel then selects a socket from that array by a simple receive-queue-CPU mapping function, similar to this pseudo-code:
fanout_group_array[cpu_id(rx_queue_X_handler) % size(fanout_group_array)]
Thus, for a receive-queue to same application thread CPU mapping you need to be careful that the pinned application threads join the fanout group sequentially, deterministically and ordered, i.e. join first the thread that is pinned to CPU 0, then the thread that is pinned to CPU 1, etc.
To answer your bonus question: Yes, pinning a packet-processing application thread to the CPU1 where the packet was received increases the probability that it's still in the non-shared higher-level (faster) CPU cache.
So this is advantageous for an application thread.
But, the higher the packet rate the more often the thus pinned application thread gets interrupted. So that application thread's throughput might be quite unstable - even up to the point of starvation (i.e. such that your application thread only gets a tiny slice of that CPU -> effectively zero throughput).
Thus, another strategy for high packet rates is to pin your application threads to cores where no receive-queue is assigned to.
Depending on your multi-core processor and pinning, you can still benefit from CPU caching, i.e. when both cores share a possibly very large L3 cache.
IOW, cache locality isn't black-and-white.
If you have enough CPU cores, you can have an M:N rx-queue-CPU to application-CPU mapping where M and N are of the same size but disjoint. But perhaps you application threads doesn't need as much CPU cycles and then you only M<N
for disjoint M:N.
Context switches happen if there are multiple runnable tasks on a CPU. You can avoid that by isolating the CPUs that you have pinned your threads to. When your application thread is interrupted by an rx-queue interrupt (and possible kernel processing, e.g. in the CPU's kworker thread) it isn't a context switch, but a mode transition. Which is less expensive than a context switch. IOW, you don't influence the number of context switches by having a 1:1 mapping.
1To avoid confusion: in this context we use CPU==CPU Core==Core. A processor can have one or many CPU cores.
Upvotes: 0
Reputation: 173
Since I do have have 50+ reputation, I cannot comment. I'll leave a reply here instead. @Jim D's answer is correct. I will add that (at least with recent Linux kernel versions 4.X+) sockets are added to the array of sockets in the fanout group in the order you add them. The 0th socket to be added will be in position 0, the 1st in position 1, and so forth. This means, if you have an interrupt from your NIC pinned to CPU 0 and you want it to be handled by an application thread you also have pinned to CPU 0 (and likewise for CPU 1), the CPU fanout algorithm will work for you. If you have a NIC that supports DDIO, you'll be more likely to get a cache hit this way too when reading the frame.
However, it's worth noting that if you remove a socket from the fanout group, it will be removed from the array and the last socket in the array will be swapped in its place (i.e. removing socket in position i
--> sock_arr[i] = sock_arr[num_sockets_in_fanout_group - 1]; num_sockets_in_fanout_group--;
). So, the order is deterministic, but keep in mind that the order can change if you are adding and removing sockets from the group dynamically.
One other thing I would like to add is that if one's goal is to scale across multiple cores to increase throughput, it is worth evaluating whether or not you should skip CPU 0 and start with CPU 1 instead. CPU 0 is where OS tasks and others that are not "affinitized" to a particular CPU core typically run. Similarly, it would be worth ensuring no other interrupts or tasks are pinned to your application's cores.
Upvotes: 4
Reputation: 2403
As best I can tell, no, not quite. fanout_demux_cpu
calculates a "hash" using the cpu and the number of sockets in the fanout group, which happens to be smp_processor_id() % num
. packet_rcv_fanout
then uses this as an index into the array of sockets in the fanout group to determine which socket gets it.
Once you see that the whole design of the fanout group is based on creating some sort of hash based on the properties of the received packet, and not based on the properties of a thread trying to read a socket, you should probably just let the scheduler sort things out rather than pinning threads.
Alternatively, you could dig further into the code to reverse engineer the order of the sockets in the array, but that would be fragile, and you might want to verify that you have done so correctly using systemtap. You could then create the sockets in a deterministic order (hopefully resulting in a deterministic order in the array) and pin the thread listening on a given socket to the appropriate cpu.
Upvotes: 5