Reputation: 973
I'm not sure whether stackoverflow is the best forum, but here it goes...
We are benchmarking our software. Suddenly we saw a huge performance degradation. To test what is happening we ran the software on the same input multiple times. The results were startling (huge performance variability):
Solution time = 9.69 sec.
Solution time = 7.55 sec.
Solution time = 4.78 sec.
Solution time = 5.12 sec.
Solution time = 6.94 sec.
Solution time = 2.15 sec.
Solution time = 5.48 sec.
Our software is multithreaded, starts exactly one thread on every core (hyperthreading is disabled) of a 12-core machine. There is nothing else running on the machine. Before the weekend we've never had such a variation in runtime.
On a whim, we repeated the test with cpu binding enabled (i.e., pin each of the 12 threads to a different core):
Solution time = 0.95 sec.
Solution time = 0.95 sec.
Solution time = 0.95 sec.
Solution time = 0.95 sec.
Solution time = 0.94 sec.
Solution time = 0.95 sec.
Solution time = 0.95 sec.
At this point I have no idea what could cause this. As far as we know nothing has changed in the configuration of the machine (RHEL 6.6). I'd be grateful for any suggestion...
Thanks, --Laci
EDIT:
Just to emphasize again: in the past the non-bound code did exhibit variation, but it was on the order of 10-15% at most, and on the average it was very close to the cpu-bound code (within 1-2%). It's just since the last weekend we started to see this change, and as far as we are aware, nothing has changed in the environment. But (obviously) something must have changed, and I wonder what it could have been.
EDIT2:
I ran the code through perf (with 10 repetitions), and this is what I got.
With cpu-binding:
15713.138442 task-clock-msecs # 9.341 CPUs ( +- 0.037% )
6958 context-switches # 0.000 M/sec ( +- 0.357% )
11 CPU-migrations # 0.000 M/sec ( +- 1.786% )
49147 page-faults # 0.003 M/sec ( +- 0.514% )
45890046261 cycles # 2920.489 M/sec ( +- 0.030% )
51929307378 instructions # 1.132 IPC ( +- 0.021% )
11050565282 branches # 703.269 M/sec ( +- 0.032% )
446256370 branch-misses # 4.038 % ( +- 0.003% )
421789915 cache-references # 26.843 M/sec ( +- 0.048% )
18989944 cache-misses # 1.209 M/sec ( +- 0.305% )
1.682190890 seconds time elapsed ( +- 0.131% )
Without cpu-binding:
36219.945761 task-clock-msecs # 5.677 CPUs ( +- 3.978% )
8742 context-switches # 0.000 M/sec ( +- 1.677% )
34 CPU-migrations # 0.000 M/sec ( +- 5.243% )
48799 page-faults # 0.001 M/sec ( +- 0.839% )
106384797638 cycles # 2937.188 M/sec ( +- 3.989% )
93465235493 instructions # 0.879 IPC ( +- 3.085% )
23685574664 branches # 653.937 M/sec ( +- 3.672% )
477076300 branch-misses # 2.014 % ( +- 0.563% )
414008416 cache-references # 11.430 M/sec ( +- 0.189% )
17910783 cache-misses # 0.495 M/sec ( +- 1.468% )
6.380128466 seconds time elapsed ( +- 5.171% )
Note that the code is deterministic, i.e., it always takes the same execution path. But it is possible that a thread is busy-waiting for its turn to synchronize with a global deterministic state. But why would that cause such a huge difference in cycles/instructions/etc...
Also note that I have tried to pin the threads to the cores in a random order to test the hypothesis that pining them to the cores in the order of their creation makes a difference. But that made no difference, it was still fast.
Upvotes: 1
Views: 314
Reputation: 973
The problem is resolved. Following ekarak's suggestion I drilled down with profiling, and found that the time is wasted when our threads interact and wait for each other. On a hunch I rebooted the machine, and then everything went back to normal.
Now, we have 40 machines in a cluster and all exhibited this behavior. This means there was either an external influence or the uptime of the machines played a role. I googled for "linux uptime 209" and that turned up that in the kernel these machines are running there is an overflow bug in sched_clock() that got triggered after 208.5 days.
So... we have issues related to thread interaction on machines that were up for 209 days and there is a bug in sched_clock() that manifests after 208.5 days. That's just too much of a coincidence for me, so I concluded this is our problem.
Upvotes: 2