d0m1n1c
d0m1n1c

Reputation: 157

Why is my for loop of cilk_spawn doing better than my cilk_for loop?

I have

cilk_for (int i = 0; i < 100; i++)
   x = fib(35);

the above takes 6.151 seconds

and

for (int i = 0; i < 100; i++)
   x = cilk_spawn fib(35);

takes 5.703 seconds

The fib(x) is the horrible recursive Fibonacci number function. If I dial down the fib function cilk_for does better than cilk_spawn, but it seems to me that regardless of the time it takes to do fib(x) cilk_for should do better than cilk_spawn.

What don't I understand?

Upvotes: 1

Views: 927

Answers (1)

Arch D. Robison
Arch D. Robison

Reputation: 4049

Per comments, the issue was a missing cilk_sync. I'll expand on that to point out exactly how the ratio of time can be predicted with surprising accuracy.

On a system with P hardware threads (typically 8 on a i7) for/cilk_spawn code will execute as follows:

  1. The initial thread will execute the iteration for i=0, and leave a continuation that is stolen by some other thread.
  2. Each thief will steal an iteration and leave a continuation for the next iteration.
  3. When each thief finishes an iteration, it goes back to step 2, unless there are no more iterations to steal.

Thus the threads will execute the loop hand-over-hand, and the loop exits at a point where P-1 threads are still working on iterations. So the loop can be expected to finish after evaluating only (100-P-1) iterations.

So for 8 hardware threads, the for/cilk_spawn with missing cilk_sync should take about 93/100 of the time for the cilk_for, quite close to the observed ratio of about 5.703/6.151 = 0.927.

In contrast, in a "child steal" system such as TBB or PPL task_group, the loop will race to completion, generating 100 tasks, and then keep going until a call to task_group::wait. In that case, forgetting the synchronization would have led to a much more dramatic ratio of times.

Upvotes: 2

Related Questions