Reputation:
I want to compare the performance of Unix domain sockets between two processes with that of another IPC.
I have a basic program that creates a socket pair and then calls fork. Then, it measures the RTT to send the 8192 bytes to the other process and back (distinct for each iteration).
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <unistd.h>
int main(int argc, char **argv) {
int i, pid, sockpair[2];
char buf[8192];
struct timespec tp1, tp2;
assert(argc == 2);
// Create a socket pair using Unix domain sockets with reliable,
// in-order data transmission.
socketpair(AF_UNIX, SOCK_STREAM, 0, sockpair);
// We then fork to create a child process and then start the benchmark.
pid = fork();
if (pid == 0) { // This is the child process.
for (i = 0; i < atoi(argv[1]); i++) {
assert(recv(sockpair[1], buf, sizeof(buf), 0) > 0);
assert(send(sockpair[1], buf, sizeof(buf), 0) > 0);
}
} else { // This is the parent process.
for (i = 0; i < atoi(argv[1]); i++) {
memset(buf, i, sizeof(buf));
buf[sizeof(buf) - 1] = '\0';
assert(clock_gettime(CLOCK_REALTIME, &tp1) == 0);
assert(send(sockpair[0], buf, sizeof(buf), 0) > 0);
assert(recv(sockpair[0], buf, sizeof(buf), 0) > 0);
assert(clock_gettime(CLOCK_REALTIME, &tp2) == 0);
printf("%lu ns\n", tp2.tv_nsec - tp1.tv_nsec);
}
}
return 0;
}
However, I noticed that for each repeated test the elapsed time for the first run (i = 0) is always an outlier:
79306 ns
18649 ns
19910 ns
19601 ns
...
I wonder if the kernel has to do some final set up on the first call to send()
- for example, allocate 8192 bytes in the kernel to buffer the data between calls to send()
and recv()
?
Upvotes: 11
Views: 2304
Reputation: 3424
It's not the data copy that takes 80 extra microseconds, that would be extremely slow (100 MB/s only), it's the fact that you're using two processes and that when the parent sends the data for the first time, these data need to wait for the child to finish to fork and start to execute.
If you absolutely want to use two processes, you should first perform a send in the other direction so that the parent can wait for the child to be ready before starting to send.
Eg: Child:
send();
recv();
send();
Parent:
recv();
gettime();
send();
recv();
gettime();
Also you need to realize that your test depends a lot on process placement on the various CPU cores and if run on the same core, will cause a task switch.
For this reason I would strongly recommend that you perform the measurement using a single process. Even without poll nor anything, you can do it this way provided that you keep reasonably small blocks which fit into socket buffers :
gettime();
send();
recv();
gettime();
You should first perform a non-measured round trip to ensure buffers are allocated. I'm pretty sure you'll get much smaller times here.
Upvotes: 2
Reputation: 5590
In the linux kernel, you can find the ___sys_sendmsg
function that gets used by send
. Check here to view the code.
The function has to copy the user message (in your case the 8KB buf
) from user space to kernel space. After that recv
can copy back the received message from the kernel space to the user space of the child process.
That means you need to have 2 memcpy and one kmalloc for a send() recv() pair.
The first one is so special because the space where to store the user message is not allocated. This means also that it is not present in the data cache as well. so the first send() - recv()
pair will allocate the kernel memory where to store buf
and that will also get cached. The following calls will just use that memory using the used_address
argument in the function's prototype.
So your assumption is correct. The first run allocates the 8KB in the kernel and uses cold caches while the others just use previously allocated and cached data.
Upvotes: 1
Reputation: 365792
I'd guess that instruction-cache misses for the kernel code involved is a big part of the slowdown on the first time through. Probably also data cache misses for kernel data structures keeping track of stuff.
Lazy setup is a possibility, though.
You could test by doing a sleep(10)
between trials (including before the first trial). Do something that will use all the CPU cache, like refresh a web page, between each trial. If it's lazy setup, then the first call will be extra slow. If not, then all calls will be equally slow when caches are cold.
Upvotes: 1