Reputation: 1092
I am trying to measure performance of networking code, and I get very varied results. So far I was not able to explain it, maybe others will be able to help or point in the right direction.
So I create a socket:
socket(AF_INET, SOCK_STREAM, 0);
int one = 1;
setsockopt(socket, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)));
current = fcntl(socket, F_GETFL);
fcntl(socket, F_SETFL, O_NONBLOCK | current);
My messages are always around 200 bytes. Code that sends the message is:
uint64_t start (nanotimestamp());
unsigned char * buf;
...
//build a message
//few calls to snprintf
//buffer is preallocated
...
write(socket, buf, size);
uint64_t end (nanotimestamp());
performance = end - start;
uint64_t nanotimestamp()
{
struct timespec now;
clock_gettime(CLOCK_REALTIME, &now);
return now.tv_sec * 1e9 + now.tv_nsec;
}
The code is being run on 64bit redhat 6. Compiled with gcc 4.4 Timings vary a lot from 20 to 80 microsecond, in rare cases > 100 us.
So if the call to write is non blocking why do i see such a difference?
Upvotes: 2
Views: 2624
Reputation: 413
You should consider using CLOCK_MONOTONIC
for this measurement - it has significantly lower overhead than getting CLOCK_REALTIME
. For my performance measurement (we need nanosecond precision) I use RDTSC counter:
With GCC 4.4+ (not 100% sure on this one, 4.6.1 implements this for sure) on Intel system you could use:
#include <x86intrin.h>
uint64_t clock_count = __rdtsc();
or if not:
extern "C" {
__inline__ uint64_t rdtsc()
{
uint32_t lo, hi;
__asm__ __volatile__ (
"xorl %%eax,%%eax \n cpuid"
::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
}
then dividing the delta of clock counts by your CPU frequency by number of CPU frequency hertz will give you a very precise measurement at a fraction of the cost of clock_gettime()
Edit:
And to answer the actual question now :)
In your code, you're actually measuring 2 things - building the message and sending it. You can either measure them separately or move data building outside of the block. Writing data is expensive when you're measuring microseconds.
I would think that the problem is in combination snprintf() and cache miss. Format functions have very poor performance and since you're rebuilding data every time, there's a probability that you're getting a cache miss every once in a while, which should answer the question about variability.
Upvotes: 2
Reputation: 6447
Interruptions by other threads, hardware or software interrupts has already been mentioned.
There's another thing to consider though. Depending on various factors, your non-blocking write() call might take very different code paths. E.g. it might have to allocate additional buffers, which takes time, or it might not have to. Or it might decide that the data should be sent right away, and go "straight to the metal" (call into the driver to deliver the data to the network interface for transmission).
Allocating buffers takes time, delivering the data to the network interface even more so.
So write() might be really fast (buffering the data in an existing buffer), a little bit slower (allocating an additional buffer) or "really" slow (going straight to the metal).
Upvotes: 1
Reputation: 27572
It would be interesting to separate out the write()
calls that fail with EAGAIN from those which actually transfer data. That might account for a lot of the discrepancy.
Also I wonder if TCP_NODELAY is really helping you in this scenario. It might be worth toggling that as well.
Upvotes: 0
Reputation: 9036
Upvotes: 0
Reputation: 10311
You can't measure just 1 write, as the process might be suspended while calling write (this might account for >100us). Also, doing a syscall might introduce some variance.
You need to call write more often and measure the time for all of those calls combined.
Upvotes: 0