LaszloLadanyi
LaszloLadanyi

Reputation: 973

what can cause huge variation in runtime of a multithreaded code

I'm not sure whether stackoverflow is the best forum, but here it goes...

We are benchmarking our software. Suddenly we saw a huge performance degradation. To test what is happening we ran the software on the same input multiple times. The results were startling (huge performance variability):

Solution time = 9.69 sec.
Solution time = 7.55 sec.
Solution time = 4.78 sec.
Solution time = 5.12 sec.
Solution time = 6.94 sec.
Solution time = 2.15 sec.
Solution time = 5.48 sec.

Our software is multithreaded, starts exactly one thread on every core (hyperthreading is disabled) of a 12-core machine. There is nothing else running on the machine. Before the weekend we've never had such a variation in runtime.

On a whim, we repeated the test with cpu binding enabled (i.e., pin each of the 12 threads to a different core):

Solution time = 0.95 sec.
Solution time = 0.95 sec.
Solution time = 0.95 sec.
Solution time = 0.95 sec.
Solution time = 0.94 sec.
Solution time = 0.95 sec.
Solution time = 0.95 sec.

At this point I have no idea what could cause this. As far as we know nothing has changed in the configuration of the machine (RHEL 6.6). I'd be grateful for any suggestion...

Thanks, --Laci

EDIT:

Just to emphasize again: in the past the non-bound code did exhibit variation, but it was on the order of 10-15% at most, and on the average it was very close to the cpu-bound code (within 1-2%). It's just since the last weekend we started to see this change, and as far as we are aware, nothing has changed in the environment. But (obviously) something must have changed, and I wonder what it could have been.

EDIT2:

I ran the code through perf (with 10 repetitions), and this is what I got.

With cpu-binding:

   15713.138442  task-clock-msecs         #      9.341 CPUs    ( +-   0.037% )
           6958  context-switches         #      0.000 M/sec   ( +-   0.357% )
             11  CPU-migrations           #      0.000 M/sec   ( +-   1.786% )
          49147  page-faults              #      0.003 M/sec   ( +-   0.514% )
    45890046261  cycles                   #   2920.489 M/sec   ( +-   0.030% )
    51929307378  instructions             #      1.132 IPC     ( +-   0.021% )
    11050565282  branches                 #    703.269 M/sec   ( +-   0.032% )
      446256370  branch-misses            #      4.038 %       ( +-   0.003% )
      421789915  cache-references         #     26.843 M/sec   ( +-   0.048% )
       18989944  cache-misses             #      1.209 M/sec   ( +-   0.305% )

    1.682190890  seconds time elapsed   ( +-   0.131% )

Without cpu-binding:

   36219.945761  task-clock-msecs         #      5.677 CPUs    ( +-   3.978% )
           8742  context-switches         #      0.000 M/sec   ( +-   1.677% )
             34  CPU-migrations           #      0.000 M/sec   ( +-   5.243% )
          48799  page-faults              #      0.001 M/sec   ( +-   0.839% )
   106384797638  cycles                   #   2937.188 M/sec   ( +-   3.989% )
    93465235493  instructions             #      0.879 IPC     ( +-   3.085% )
    23685574664  branches                 #    653.937 M/sec   ( +-   3.672% )
      477076300  branch-misses            #      2.014 %       ( +-   0.563% )
      414008416  cache-references         #     11.430 M/sec   ( +-   0.189% )
       17910783  cache-misses             #      0.495 M/sec   ( +-   1.468% )

    6.380128466  seconds time elapsed   ( +-   5.171% )

Note that the code is deterministic, i.e., it always takes the same execution path. But it is possible that a thread is busy-waiting for its turn to synchronize with a global deterministic state. But why would that cause such a huge difference in cycles/instructions/etc...

Also note that I have tried to pin the threads to the cores in a random order to test the hypothesis that pining them to the cores in the order of their creation makes a difference. But that made no difference, it was still fast.

Upvotes: 1

Views: 314

Answers (1)

LaszloLadanyi
LaszloLadanyi

Reputation: 973

The problem is resolved. Following ekarak's suggestion I drilled down with profiling, and found that the time is wasted when our threads interact and wait for each other. On a hunch I rebooted the machine, and then everything went back to normal.

Now, we have 40 machines in a cluster and all exhibited this behavior. This means there was either an external influence or the uptime of the machines played a role. I googled for "linux uptime 209" and that turned up that in the kernel these machines are running there is an overflow bug in sched_clock() that got triggered after 208.5 days.

So... we have issues related to thread interaction on machines that were up for 209 days and there is a bug in sched_clock() that manifests after 208.5 days. That's just too much of a coincidence for me, so I concluded this is our problem.

Upvotes: 2

Related Questions