Reputation: 2357
I have a small linux kernel module that is a prototype for a device driver for hardware that doesn't exist yet. The code needs to do a short bit of computation as fast as possible from beginning to end with a duration that is a few microseconds. I am trying to measure whether this is possible with the intel rdtscp instruction using an ndelay()
call to simulate the computation. I find that 99.9% of the time it runs as expected, but 0.1% of the time it has a very large delay that appears as if something else is preempting the code despite running inside a spinlock which should be disabling interrupts. This is run using a stock Ubuntu 64 bit kernel (4.4.0-112) with no extra realtime or low latency patches.
Here is some example code that replicates this behavior. This is written as a handler for a /proc
filesystem entry for easy testing, but I have only shown the function that actually computes the delays:
#define ITERATIONS 50000
#define SKIPITER 10
DEFINE_SPINLOCK(timer_lock);
static int timing_test_show(struct seq_file *m, void *v)
{
uint64_t i;
uint64_t first, start, stop, delta, max=0, min=1000000;
uint64_t avg_ticks;
uint32_t a, d, c;
unsigned long flags;
int above30k=0;
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
first = a | (((uint64_t)d)<<32);
for (i=0; i<ITERATIONS; i++) {
spin_lock_irqsave(&timer_lock, flags);
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
start = a | (((uint64_t)d)<<32);
ndelay(1000);
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
stop = a | (((uint64_t)d)<<32);
spin_unlock_irqrestore(&timer_lock, flags);
if (i < SKIPITER) continue;
delta = stop-start;
if (delta < min) min = delta;
if (delta > max) max = delta;
if (delta > 30000) above30k++;
}
seq_printf(m, "min: %llu max: %llu above30k: %d\n", min, max, above30k);
avg_ticks = (stop - first) / ITERATIONS;
seq_printf(m, "Average total ticks/iteration: %llu\n", avg_ticks);
return 0;
}
Then if I run:
# cat /proc/timing_test
min: 4176 max: 58248 above30k: 56
Average total ticks/iteration: 4365
This is on a 3.4 GHz sandy bridge generation Core i7. The ~4200 ticks of the TSC is about right for a little over 1 microsecond delay. About 0.1% of the time I see delays about 10x longer than expected, and in some cases I have seen times as long as 120,000 ticks.
These delays appear too long to be a single cache miss, even to DRAM. So I think it either has to be several cache misses, or another task preempting the CPU in the middle of my critical section. I would like to understand the possible causes of this to see if they are something we can eliminate or if we have to move to a custom processor/FPGA solution.
Things I have tried:
rdmsr
on MSR_SMI_COUNT
. I tried adding that before and after and there are no SMM interrupts happening while my code is executing.ndelay()
takes into account variable clock speed, but I think the CPU clock only varies by a factor of 2, so this should not cause a >10x change.Upvotes: 13
Views: 1023
Reputation: 52
FYI, in my system:
timingtest % uname -a
Linux xxxxxx 4.15.0-42-generic #45-Ubuntu SMP Thu Nov 15 19:32:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Replicating your example (with ndelay(1000);) I get:
timingtest % sudo cat /proc/timing_test
min: 3783 max: 66883 above30k: 20
Average total ticks/iteration: 4005
timingtest % sudo cat /proc/timing_test
min: 3783 max: 64282 above30k: 19
Average total ticks/iteration: 4010
Replicating your example (with udelay(1);) I get:
timingtest % sudo cat /proc/timing_test
min: 3308 max: 43301 above30k: 2
Average total ticks/iteration: 3611
timingtest % sudo cat /proc/timing_test
min: 3303 max: 44244 above30k: 2
Average total ticks/iteration: 3600
ndelay(),udelay(),mdelay() are for use in atomic context as stated here: https://www.kernel.org/doc/Documentation/timers/timers-howto.txt They all rely on __const_udelay() funtion that is a vmlinux exported symbol (using: LFENCE/RDTSC instructions).
Anyway, I replaced the delay with:
for (delta=0,c=0; delta<500; delta++) {c++; c|=(c<<24); c&=~(c<<16);}
for a trivial busy loop, with the same results.
I also tryed with _cli()/_sti(), local_bh_disable()/local_bh_enable() and preempt_disable()/preempt_enable() without success.
Examinig SMM interrupts (before and after delay) with:
__asm__ volatile ("rdmsr" : "=a" (a), "=d" (d) : "c"(0x34) : );
smi_after = (a | (((uint64_t)d)<<32));
I always obtain the same number (no SMI or register not updated).
Executing the cat command with trace-cmd to explore what's happening, I get results suprissingly not so scattered in time. (!?)
timingtest % sudo trace-cmd record -o trace.dat -p function_graph cat /proc/timing_test
plugin 'function_graph'
min: 3559 max: 4161 above30k: 0
Average total ticks/iteration: 5863
...
In my system, the problem can be solved making use of Power management Quality of Service, see (https://access.redhat.com/articles/65410). Hope this helps
Upvotes: 0
Reputation: 2214
Another thing that I have just noticed is that it is unclear what ndelay()
does. Maybe you should show it so as non-trivial problems may be lurking inside it.
For example, I've observed once that my piece of a kernel driver code was still preempted when it had a memory leak inside it, so as soon as it hit some watermark limit, it was put aside even if it disabled interrupts.
Upvotes: 1
Reputation: 2214
120,000 ticks that you observed in extreme cases sounds a lot like an SMM handler. Lesser values might have been caused by an assortment of microarchitectural events (by the way, have you checked all the performance counters available to you?), but this is something that must be caused by a subroutine written by someone who was not writing his/her code to achieve minimal latency.
However you stated that you've checked that no SMIs are observed. This leads me to think that either something is wrong with kernel facilities to count/report them, or with your method to look after them. Hunting after SMI without a hardware debugger may be a frustrating endeavor.
Upvotes: 0