Reputation: 50238
I know that I have to use: rdtsc. The measured function is deterministic but the result is far from being repeatable (I get 5% oscillations from run to run). Possible causes are:
Do you know any other causes? How to eliminate them?
Upvotes: 3
Views: 6408
Reputation: 21270
Actually there is the new perf subsystem in newer Linux kernels. Example:
$ ./perf stat du -s /tmp 94796 /tmp Performance counter stats for 'du -s /tmp': 2.546403 task-clock-msecs # 0.060 CPUs 3 context-switches # 0.001 M/sec 0 CPU-migrations # 0.000 M/sec 166 page-faults # 0.065 M/sec 2434963 cycles # 956.236 M/sec 1798092 instructions # 0.738 IPC 302969 branches # 118.979 M/sec 26197 branch-misses # 8.647 % 23217 cache-references # 9.118 M/sec 4621 cache-misses # 1.815 M/sec 0.042406580 seconds time elapsed
Upvotes: 2
Reputation: 2696
TSCs (what rdtsc uses) are often not synchronized on multi-processor systems. It may help to set the CPU affinity in order to bind the process to a single CPU.
You could also get timestamps from HPET timers if available, which aren't prone to the same problem.
As for repeatability, those variances are true. You could disable caching, give a realtime priority to the process and/or (if on Linux or something similar) recompile your kernel with a lower, fixed timer interrupt frequency (the one that does time-slicing). You can't eliminate the variance completely, at least not easily and not on regular CPU + OS combos.
In general, for easy coding, reliability and portability reasons, I suggest you use what the OS has to offer. If it offers high-precision timers, use the appropriate OS helper.
(Just in case you're trying a time attack on a crypto system, well, you're going to have to live with 1. this randomness and 2. general defenses that make the system unpredictable for good reasons, so the function might not be deterministic with respect to time.)
EDIT: added paragraph about timers the OS can offer.
EDIT: This refers to Linux. For binding a process to a single CPU (to have an accurate read from RDTSC), you can use sched_setaffinity(2). And here's some code from one of my projects using it for some other purpose (mapping threads to CPUs). This should be your first try. As for HPETs, you can use regular POSIX calls like these, as long as the kernel and the machine support these timers.
Upvotes: 5
Reputation: 5870
Most modern processors support a remarkable set of low-level hardware performance counters. If you really want to know the answers, including real measurements on cache misses and context switching overhead, grab the PAPI (Performance API) toolkit, then, on some (though not all) OSes install one kernel patch, and, with some additional effort, you're off and running.
Upvotes: 0
Reputation: 941267
Why eliminate them? Sounds like you've created a realistic benchmark. That code would have the same kind of variability when used in the wild as well. Probably worse since you're likely to have eliminated disk and CPU cache latencies. Using Jon Skeet's approach, creating conditions that gives you the best possible result, will only leave you with a result that makes you feel good but is never achievable.
If an absolute number is important, calculate the median, not the average.
Upvotes: 3
Reputation: 20609
Adding to the list of causes: branch prediction/misprediction (this can be effected by a context switch with complex prediction caches on some chips. Also prediction can be affected by different inputs to your program and so direct comparison between the time for two different data sets can be slightly skewed.
In general, it is nigh impossible to mitigate all of these, but there are some things you can do to help each one:
But, of course, by far the best method of doing timings like this is to do them very many times on very large pieces of data such that the variability introduced by things you can't control is minimized. (It can never really be erased.)
Upvotes: 1
Reputation: 64026
See the question Is stopwatch benchmarking acceptable? for a discourse on the variance of micro-benchmarking on a modern multi-core multi-threading multi-process machine.
Although the question is about Java, the considerations apply to benchmarking in any language.
Also see: How do I write a correct micro-benchmark in Java?
Alse see: What advice can you give me for writing a meaningful benchmark?
Upvotes: 2