Reputation: 1092

non blocking socket write performance in c++

I am trying to measure performance of networking code, and I get very varied results. So far I was not able to explain it, maybe others will be able to help or point in the right direction.

So I create a socket:

socket(AF_INET, SOCK_STREAM, 0);
int one = 1;
setsockopt(socket, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)));
current = fcntl(socket, F_GETFL);
fcntl(socket, F_SETFL, O_NONBLOCK | current);

My messages are always around 200 bytes. Code that sends the message is:

uint64_t start (nanotimestamp());
unsigned char * buf;
... 
//build a message
//few calls to snprintf
//buffer is preallocated
...
write(socket, buf, size);
uint64_t end (nanotimestamp());

performance = end - start;

uint64_t nanotimestamp()
{
struct timespec now;
clock_gettime(CLOCK_REALTIME, &now);
return now.tv_sec * 1e9 + now.tv_nsec;
}

The code is being run on 64bit redhat 6. Compiled with gcc 4.4 Timings vary a lot from 20 to 80 microsecond, in rare cases > 100 us.

So if the call to write is non blocking why do i see such a difference?

Upvotes: 2

Answers (5)

Lirm

Reputation: 413

You should consider using CLOCK_MONOTONIC for this measurement - it has significantly lower overhead than getting CLOCK_REALTIME. For my performance measurement (we need nanosecond precision) I use RDTSC counter:

With GCC 4.4+ (not 100% sure on this one, 4.6.1 implements this for sure) on Intel system you could use:

#include <x86intrin.h>
uint64_t clock_count = __rdtsc();

or if not:

extern "C" {
    __inline__ uint64_t rdtsc()
    {
        uint32_t lo, hi;
        __asm__ __volatile__ (
            "xorl %%eax,%%eax \n        cpuid"
            ::: "%rax", "%rbx", "%rcx", "%rdx");
        __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
        return (uint64_t)hi << 32 | lo;
    }
}

then dividing the delta of clock counts by your CPU frequency by number of CPU frequency hertz will give you a very precise measurement at a fraction of the cost of clock_gettime()

Edit:

And to answer the actual question now :)

In your code, you're actually measuring 2 things - building the message and sending it. You can either measure them separately or move data building outside of the block. Writing data is expensive when you're measuring microseconds.

I would think that the problem is in combination snprintf() and cache miss. Format functions have very poor performance and since you're rebuilding data every time, there's a probability that you're getting a cache miss every once in a while, which should answer the question about variability.

Upvotes: 2

Paul Groke

Reputation: 6447

Interruptions by other threads, hardware or software interrupts has already been mentioned.

There's another thing to consider though. Depending on various factors, your non-blocking write() call might take very different code paths. E.g. it might have to allocate additional buffers, which takes time, or it might not have to. Or it might decide that the data should be sent right away, and go "straight to the metal" (call into the driver to deliver the data to the network interface for transmission).

Allocating buffers takes time, delivering the data to the network interface even more so.

So write() might be really fast (buffering the data in an existing buffer), a little bit slower (allocating an additional buffer) or "really" slow (going straight to the metal).

Upvotes: 1

Duck

Reputation: 27572

It would be interesting to separate out the write() calls that fail with EAGAIN from those which actually transfer data. That might account for a lot of the discrepancy.

Also I wonder if TCP_NODELAY is really helping you in this scenario. It might be worth toggling that as well.

Upvotes: 0

yves Baumes

Reputation: 9036

it would have been preferable to surround write() call only in order to know how much time does it take to perform its "non-blocking" duty.
The thread running that code may be prempted at any moment by the kernel. Which will schedule another thread/process on the current core. It will re-schedule your process later. 20-80us could be the execution time of another process. (While I was more expecting 20-50 milliseconds, but it depends on your kernel and its configuration).

Upvotes: 0

tstenner

Reputation: 10311

You can't measure just 1 write, as the process might be suspended while calling write (this might account for >100us). Also, doing a syscall might introduce some variance.

You need to call write more often and measure the time for all of those calls combined.

Upvotes: 0

non blocking socket write performance in c++

Answers (5)

Related Questions