bsmartins
bsmartins

Reputation: 278

Why does the measured network latency change if I use a sleep?

I'm trying to determine the time that it takes for a machine to receive a packet, process it and give back an answer.

This machine, that I'll call 'server', runs a very simple program, which receives a packet (recv(2)) in a buffer, copies the received content (memcpy(3)) to another buffer and sends the packet back (send(2)). The server runs NetBSD 5.1.2.

My client measures the round-trip time a number of times (pkt_count):

struct timespec start, end;
for(i = 0; i < pkt_count; ++i)
{
    printf("%d ", i+1);

    clock_gettime(CLOCK_MONOTONIC, &start);        
    send(sock, send_buf, pkt_size, 0);
    recv(sock, recv_buf, pkt_size, 0);
    clock_gettime(CLOCK_MONOTONIC, &end);        

    //struct timespec nsleep = {.tv_sec = 0, .tv_nsec = 100000};
    //nanosleep(&nsleep, NULL);

    printf("%.3f ", timespec_diff_usec(&end, &start));
}   

I removed error checks and other minor things for clarity. The client runs on an Ubuntu 12.04 64-bit. Both programs run in real-time priority, although only the Ubuntu kernel is real time (-rt). The connection between the programs is TCP. This works fine and gives me an average of 750 microseconds.

However, if I enable the commented out nanosleep call (with a sleep of 100 µs), my measures drop 100 µs, giving an average of 650 µs. If I sleep for 200 µs, the measures drop to 550 µs, and so on. This goes up until a sleep of 600 µs, giving an average of 150 µs. Then, if I raise the sleep to 700 µs, my measures go way up to 800 µs in average. I confirmed my program's measures with Wireshark.

I can't figure out what is happening. I already set the TCP_NODELAY socket option in both client and server, no difference. I used UDP, no difference (same behavior). So I guess this behavior is not due to the Nagle algorithm. What could it be?

[UPDATE]

Here's a screenshot of the output of the client together with Wireshark. Now, I ran my server in another machine. I used the same OS with the same configuration (as it is a Live System in a pen drive), but the hardware is different. This behaviour didn't show up, everything worked as expected. But the question remains: why does it happen in the previous hardware?

Output Comparison

[UPDATE 2: More info]

As I said before, I tested my pair of programs (client/server) in two different server computers. I plotted the two results obtained.

Comparison between two servers

The first server (the weird one) is a RTD Single Board Computer, with a 1Gbps Ethernet interface. The second server (the normal one) is a Diamond Single Board Computer with a 100Mbps Ethernet interface. Both of them run the SAME OS (NetBSD 5.1.2) from the SAME pendrive.

From these results, I do believe that this behaviour is due either to the driver or the to NIC itself, although I still can't imagine why it happens...

Upvotes: 11

Views: 1475

Answers (5)

egur
egur

Reputation: 7970

I have an advice on how to create a more accurate performance measurement. Use the RDTSC instruction (or even better the intrinsic __rdtsc() function). This involves reading a CPU counter without leaving ring3 (no system call). The gettime functions almost always involve a system call which slows things down.

Your code is a little tricky as it involves 2 system calls (send/recv), but in general it is better to call sleep(0) before the first measurement to ensure that the very short measurement doesn't receive a context switch. Off course the time measurement (and Sleep()) code should be disabled/enabled via macros in performance sensitive functions.

Some operating systems can be tricked into raising your process priority by having your process release it execution time window (e.g. sleep(0)). On the next schedule tick, the OS (not all) will raise the priority of your process since it didn't finish running its execution time quota.

Upvotes: 0

bsmartins
bsmartins

Reputation: 278

OK, I reached a conclusion.

I tried my program using Linux, instead of NetBSD, in the server. It ran as expected, i.e., no matter how much I [nano]sleep in that point of the code, the result is the same.

This fact tells me that the problem might lie in the NetBSD's interface driver. To identify the driver, I read the dmesg output. This is the relevant part:

wm0 at pci0 dev 25 function 0: 82801I mobile (AMT) LAN Controller, rev. 3
wm0: interrupting at ioapic0 pin 20
wm0: PCI-Express bus
wm0: FLASH
wm0: Ethernet address [OMMITED]
ukphy0 at wm0 phy 2: Generic IEEE 802.3u media interface
ukphy0: OUI 0x000ac2, model 0x000b, rev. 1
ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto

So, as you can see, my interface is called wm0. According to this (page 9) I should check which driver is loaded by consulting the file sys/dev/pci/files.pci, line 625 (here). It shows:

# Intel i8254x Gigabit Ethernet
device  wm: ether, ifnet, arp, mii, mii_bitbang
attach  wm at pci
file    dev/pci/if_wm.c         wm

Then, searching through the driver source code (dev/pci/if_wm.c, here), I found a snippet of code that might change the driver behavior:

/*
 * For N interrupts/sec, set this value to:
 * 1000000000 / (N * 256).  Note that we set the
 * absolute and packet timer values to this value
 * divided by 4 to get "simple timer" behavior.
 */

sc->sc_itr = 1500;              /* 2604 ints/sec */
CSR_WRITE(sc, WMREG_ITR, sc->sc_itr);

Then I changed this 1500 value to 1 (trying to increase the number of interrupts per second allowed) and to 0 (trying to eliminate the interrupt throttling altogether), but both of these values produced the same result:

  • Without nanosleep: latency of ~400 us
  • With a nanosleep of 100 us: latency of ~230 us
  • With a nanosleep of 200 us: latency of ~120 us
  • With a nanosleep of 260 us: latency of ~70 us
  • With a nanosleep of 270 us: latency of ~60 us (minimum latency I could achieve)
  • With a nanosleep of anything above 300 us: ~420 us

This is, at least better behaved than the previous situation.

Therefore, I concluded that the behavior is due to the interface driver of the server. I am not willing to investigate it further in order to find other culprits, as I am moving from NetBSD to Linux for the project involving this Single Board Computer.

Upvotes: 2

ash
ash

Reputation: 5165

If the amount of data being sent by the application is large and fast enough, it could be filling the kernel buffers, which leads to a delay on each send(). Since the sleep is outside the measured section, it would then be eating the time that would otherwise be spent blocking on the send() call.

One way to help check for this case would be to run with a relatively small number of iterations, and then a moderate number of iterations. If the problem occurs with a small number of iterations (say 20) with small packet sizes (say <1k), then this is likely an incorrect diagnoses.

Keep in mind that your process and the kernel can easily overwhelm the network adapter and the wire-speed of the ethernet (or other media type) if sending data in a tight loop like this.

I'm having trouble reading the screen shots. If wireshark shows a constant rate of transmission on the wire, then it suggests this is the correct diagnoses. Of course, doing the math - dividing the wirespeed by the packet size (+ header) - should give an idea of the maximum rate at which the packets can be sent.

As for the 700 microseconds leading to increased delay, that's harder to determine. I don't have any thoughts on that one.

Upvotes: 0

Znik
Znik

Reputation: 1146

I think 'quanta' is the best theory for explanation. On linux it is context switch frequency. Kernel gives to process quanta time. But process is preempted in two situations:

  1. Process call system procedure
  2. quanta time is ended
  3. hardware interrupt is comming (from network, hdd, usb, clock, etc...)

Unused quanta time is assigned to another ready to run process, using priorities/rt etc.

Actually context switch frequency is configured at 10000 times per second, it gives about 100us for quanta. but content switching takes a time, it is cpu depended, see this: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html i dont understad, why content swith frequency is that high but it is discussion for linux kernel forum.

partially similar problem you can find here: https://serverfault.com/questions/14199/how-many-context-switches-is-normal-as-a-function-of-cpu-cores-or-other

Upvotes: 0

Jimbo
Jimbo

Reputation: 4515

This is a (hopefully educated) guess, but I think it might explain what you're seeing.

I'm not sure how real time the linux kernel is. It might not be fully pre-emptive... So, with that disclaimer, continuing :)...

Depending on the scheduler, a task will possibly have what is called a "quanta", which is just an ammount of time it can run for before another task of same priority will be scheduled in. If the kernel is not fully pre-emptive, this might also be the point where a higher priority task can run. This depends on the details of the scheduler which I don't know enough about.

Anywhere between your first gettime and second gettime your task may be pre-empted. This just means that it is "paused" and another task gets to use the CPU for a certain amount of time.

The loop without the sleep might go something like this

clock_gettime(CLOCK_MONOTONIC, &start);        
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);  

printf("%.3f ", timespec_diff_usec(&end, &start));

clock_gettime(CLOCK_MONOTONIC, &start);        

<----- PREMPTION .. your tasks quanta has run out and the scheduler kicks in
       ... another task runs for a little while     
<----- PREMPTION again and your back on the CPU

send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);  

// Because you got pre-empted, your time measurement is artifically long
printf("%.3f ", timespec_diff_usec(&end, &start));

clock_gettime(CLOCK_MONOTONIC, &start);        

<----- PREMPTION .. your tasks quanta has run out and the scheduler kicks in
       ... another task runs for a little while     
<----- PREMPTION again and your back on the CPU

and so on....

When you put the nanosecond sleep in, this is most likely a point where the scheduler is able to run before the current task's quanta expires (the same would apply to recv() too, which blocks). So perhaps what you get is something like this

clock_gettime(CLOCK_MONOTONIC, &start);        
send(sock, send_buf, pkt_size, 0);
recv(sock, recv_buf, pkt_size, 0);
clock_gettime(CLOCK_MONOTONIC, &end);  

struct timespec nsleep = {.tv_sec = 0, .tv_nsec = 100000};
nanosleep(&nsleep, NULL);

<----- PREMPTION .. nanosleep allows the scheduler to kick in because this is a pre-emption point
       ... another task runs for a little while     
<----- PREMPTION again and your back on the CPU

// Now it so happens that because your task got prempted where it did, the time
// measurement has not been artifically increased. Your task then can fiish the rest of 
// it's quanta
printf("%.3f ", timespec_diff_usec(&end, &start));

clock_gettime(CLOCK_MONOTONIC, &start);        
... and so on

Some kind of interleaving will then occur where sometimes you are prempted between the two gettime()'s and sometimes outside of them because of the nanosleep. Depending on x, you might hit a sweet spot where you happen (by chance) to get your pre-emption point, on average, to be outside your time measurement block.

Anyway, that's my two-pennies worth, hope it helps explain things :)

A little note on "nanoseconds" to finish with...

I think one needs to be cautious with "nanoseconds" sleep. The reason I say this is that I think it is unlikely that an average computer can actually do this unless it uses special hardware.

Normally an OS will have a regular system "tick", generated at perhaps 5ms. This is an interrupt generated by say a RTC (Real Time Clock - just a bit of hardware). Using this "tick" the system then generates it's internal time representation. Thus, the average OS will only have a time resolution of a few milliseconds. The reason that this tick is not faster is that there is a balance to be achieved between keeping a very accurate time and not swamping the system with timer interrupts.

Not sure if I'm a little out of date with your average modern PC... I think some of them do have higher res timers, but still not into the nanosecond range and they might even struggle at 100uS.

So, in summary, keep in mind that the best time resolution you're likely to get is normally in the milliseconds range.

EDIT: Just revisiting this and thought I'd add the following... doesn't explain what your seeing but might provide another avenue for investigation...

As mentioned the timing accuracy of the nanosleep is unlikely to be better than milliseconds. Also your task can be pre-empted which will also cause timing problems. There is also the problem that the time taken for a packet to go up the protocol stack can vary, as well as network delay.

One thing you could try is, if your NIC supports, IEEE1588 (aka PTP). If your NIC supports it, it can timestamp PTP event packets as they leave and enter the PHY. This will give you the bes tpossible estimate of network delay. This eliminates any problems you might have with software pre-emption etc etc. I know squat about Linux PTP I'm afraid, but you could try http://linuxptp.sourceforge.net/

Upvotes: 0

Related Questions