Multithreaded algorithms work much slower

I have tried with OpenMP and Cilk Plus. The result is the same, multithreading works slower. I don't know what I'm doing wrong. I did what the guy did in this tutorial

His code works better in parallel, while the situation in mine is like this:

PARALLEL: Fibonacci number #42 is 267914296
Calculated in 33.026 seconds using 8 workers

SERIAL: Fibonacci number #42 is 267914296
Calculated in 2.110 seconds using 8 workers

I exactly copied the source code of the tutorial.

I also tried it with OpenMP, the same thing happens there too. I check the usage of CPU cores during the execution. They all work, it is fine.

I tried to change the number of workers with this command:

export CILK_NWORKERS=4

It appears as the number of workers increases, the algorithm runs slower. But sometimes it doesn't. I implemented Cilk codes on both C and C++. No difference.

This is the sequential Fibonacci function:

int fib_s(int n)
{
    if (n < 2)
        return n;
    int x = fib_s(n-1);
    int y = fib_s(n-2);

    return x + y;
}

This is the parallel Fibonacci function:

int fib(int n)
{
    if (n < 2)
        return n;
    int x = cilk_spawn fib(n-1);
    int y = fib(n-2);
    cilk_sync;
    return x + y;
}

And I calculate running time like this in main() function:

clock_t start = clock();
int result = fib(n);
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;

Can anyone help me?

Upvotes: 1

Answers (2)

user3666197

Reputation: 1

Q : Can anyone help me?

Yes. You will see, that fib( 42 ) may take less than 25 [us] in a single-worker interpreted (!) code

Given the PARALLEL code above has been reported to spend ~33 [s] on processing, a compiled-code can compute a fib( ~ 1,700,000 ) during the same ~33 [s], if designed right.

Solution :

Any recursively formulated problem description is an Ol' Mathematicians' sin:

While it may look pretty cool on paper,
it scales ugly on stack and blocks awfully lot of resources for any deeper recursion...
making all "previous"-levels wait most of the time,
until both return 2 and return 1 have happened in all their descendant paths
and the accumulation-phase of the recursively formulated algorithm begins to grow surfacing back to the top from all the depth of the deep-recursion dive.

This dependency-tree equals to a pure-[SERIAL] ( one-after-another ) progress of computing, and any attempt to inject { [CONCURENT] | [PARALLEL] }-processing orchestration will but increase the costs of processing ( adding all the add-on overheads ) without any improvement of the pure-[SERIAL] sequence of dependency-driven accumulation of the result.

Let's have a look, how AWFULLY bad the `cilk_spawn fib( N )` went :

f(42)
   |
   x=--> --> --> --> --> --> --> --> --> --> --> -- --> --> --> --> --> --> --> --> --> --> --> --> --> -->f(41)
   |                                                                                                          |
   y=f(40)                                                                                                    x=--> --> --> --> --> --> --> --> --> -->  f(40)
   ~    |                                                                                                     |                                             |
   ~    x=--> --> --> --> --> --> --> --> --> f(39)                                                           y=f(39)                                       x=--> --> --> --> --> --> --> --> -->  f(39)
   ~    |                                        |                                                            ~    |                                        |                                         |   
   ~    y=f(38)                                  x=--> --> --> --> --> --> f(38)                              ~    x=--> --> --> --> f(38)                  y=f(38)                                   x=--> --> --> --> --> --> f(38)
   ~    ~    |                                   |                            |                               ~    |                    |                   ~    |                                    |                            |
   ~    ~    x=--> --> f(37)                     y=f(37)                      x=--> --> f(37)                 ~    y=f(37)              x=--> --> --> f(37) ~    x=--> --> f(37)                      y=f(37)                      x=--> --> f(37)
   ~    ~    |            |                      ~    |                       |            |                  ~    ~    |               |                |  ~    |            |                       ~    |                       |            |
   ~    ~    y=f(36)      x=--> --> f(36)        ~    x=--> --> f(36)         y=f(36)      x=-->f(36)         ~    ~    x=--> --> f(36) y=f(36)          x= ~    y=f(36)      x=--> --> f(36)         ~    x=--> --> f(36)         y=f(36)      x=--> --> f(36)
   ~    ~    ~    |       |            |         ~    |            |          ~    |       |       |          ~    ~    |            |  ~    |           |  ~    ~    |       |            |          ~    |            |          ~    |       |            |
   ~    ~    ~    x=-->f  y=f(35)      x=-->f    ~    y=f(35)      x=-->f(35) ~    x=-->f  y=f(35) x=-->f     ~    ~    y=f(35)      x= ~    x=-->f(35)  y= ~    ~    x=-->f  y=f(35)      x=-->f(35) ~    y=f(35)      x=-->f(35) ~    x=-->   y=f(35)      x=-->f(35)
   ~    ~    ~    |       ~    |       |         ~    ~    |       |       |  ~    |       ~    |  |          ~    ~    ~    |       |  ~    |       |   ~  ~    ~    |       ~    |       |       |  ~    ~    |       |       |  ~    |       ~    |       |       |
   ~    ~    ~    y=f(34) ~    x=-->f  y=f(34)   ~    ~    x=-->f  y=f(34) x= ~    y=f(34) ~    x= y=f(34)    ~    ~    ~    x=-->f  y= ~    y=f(34) x=  ~  ~    ~    y=f(34) ~    x=-->f  y=f(34) x= ~    ~    x=-->f  y=f(34) x= ~    y=f(34) ~    x=-->f  y=f(34) x=-->f
   ~    ~    ~    ~    |  ~    |       ~    |    ~    ~    |       ~    |  |  ~    ~    |  ~    |  ~    |     ~    ~    ~    |       ~  ~    ~    |  |   ~  ~    ~    ~    |  ~    |       ~    |  |  ~    ~    |       ~    |  |  ~    ~    |  ~    |       ~       |
   ~    ~    ~    ~    x= ~    y=f(33) ~    x=   ~    ~    y=f(33) ~    x= y= ~    ~    x= ~    y= ~    x=    ~    ~    ~    y=f(33) ~  ~    ~    x= y=  ~  ~    ~    ~    x= ~    y=f(33) ~    x= y= ~    ~    y=f(33) ~    x= y= ~    ~    x= ~    y=f(33) ~       y=f(33)
   ~    ~    ~    ~    |  ~    ~    |  ~    |    ~    ~    ~    |  ~    |  ~  ~    ~    |  ~    ~  ~    |     ~    ~    ~    ~    |  ~  ~    ~    |  ~   ~  ~    ~    ~    |  ~    ~    |  ~    |  ~  ~    ~    ~    |  ~    |  ~  ~    ~    |  ~    ~    |  ~       ~    |
   ~    ~    ~    ~    y= ~    ~    x= ~    y=   ~    ~    ~    x= ~    y= ~  ~    ~    y= ~    ~  ~    y=    ~    ~    ~    ~    x= ~  ~    ~    y= ~   ~  ~    ~    ~    y= ~    ~    x= ~    y= ~  ~    ~    ~    x= ~    y= ~  ~    ~    y= ~    ~    x= ~       ~    x=-->f
   ~    ~    ~    ~    ~  ~    ~    |  ~    ~    ~    ~    ~    |  ~    ~  ~  ~    ~    ~  ~    ~  ~    ~     ~    ~    ~    ~    |  ~  ~    ~    ~  ~   ~  ~    ~    ~    ~  ~    ~    |  ~    ~  ~  ~    ~    ~    |  ~    ~  ~  ~    ~    ~  ~    ~    |  ~       ~    |
   :    :    :    :    :
   :    :    :    :     
   :    :    :
   ~    ~  --SYNC-----------f(36)+f(37)
   ~    ~ <--RET x+y // <-- f(38)
   ~  --SYNC----------------f(38)+f(39)
   ~ <--RET x+y      // <-- f(40)
 --SYNC---------------------f(40)+f(41)
<--RET x+y           // <-- f(42)

Just count, how many times the top-down running recursive-method of Fib( N ) has been fully re-counted for each and respective value of N - yes, you count one and the same thing that many times again and again and again, just due to the "mathematical"-lazines of the recursive method:

fib( N == 42 ) was during recursion calculated .........1x times...
fib( N == 41 ) was during recursion calculated .........1x times...
fib( N == 40 ) was during recursion calculated .........2x times...
fib( N == 39 ) was during recursion calculated .........3x times...
fib( N == 38 ) was during recursion calculated .........5x times...
fib( N == 37 ) was during recursion calculated .........8x times...
fib( N == 36 ) was during recursion calculated ........13x times...
fib( N == 35 ) was during recursion calculated ........21x times...
fib( N == 34 ) was during recursion calculated ........34x times...
fib( N == 33 ) was during recursion calculated ........55x times...
fib( N == 32 ) was during recursion calculated ........89x times...
fib( N == 31 ) was during recursion calculated .......144x times...
fib( N == 30 ) was during recursion calculated .......233x times...
fib( N == 29 ) was during recursion calculated .......377x times...
fib( N == 28 ) was during recursion calculated .......610x times...
fib( N == 27 ) was during recursion calculated .......987x times...
fib( N == 26 ) was during recursion calculated ......1597x times...
fib( N == 25 ) was during recursion calculated ......2584x times...
fib( N == 24 ) was during recursion calculated ......4181x times...
fib( N == 23 ) was during recursion calculated ......6765x times...
fib( N == 22 ) was during recursion calculated .....10946x times...
fib( N == 21 ) was during recursion calculated .....17711x times...
fib( N == 20 ) was during recursion calculated .....28657x times...
fib( N == 19 ) was during recursion calculated .....46368x times...
fib( N == 18 ) was during recursion calculated .....75025x times...
fib( N == 17 ) was during recursion calculated ....121393x times...
fib( N == 16 ) was during recursion calculated ....196418x times...
fib( N == 15 ) was during recursion calculated ....317811x times...
fib( N == 14 ) was during recursion calculated ....514229x times...
fib( N == 13 ) was during recursion calculated ....832040x times...
fib( N == 12 ) was during recursion calculated ...1346269x times...
fib( N == 11 ) was during recursion calculated ...2178309x times...
fib( N == 10 ) was during recursion calculated ...3524578x times...
fib( N ==  9 ) was during recursion calculated ...5702887x times...
fib( N ==  8 ) was during recursion calculated ...9227465x times...
fib( N ==  7 ) was during recursion calculated ..14930352x times...
fib( N ==  6 ) was during recursion calculated ..24157817x times...
fib( N ==  5 ) was during recursion calculated ..39088169x times...
fib( N ==  4 ) was during recursion calculated ..63245986x times...
fib( N ==  3 ) was during recursion calculated .102334155x times...
fib( N ==  2 ) was during recursion calculated .165580141x times...
fib( N ==  1 ) was during recursion calculated .102334155x times...

A FAST & RESOURCES' EFFICIENT PROCESSING - AN INSPIRATION :

While the original, recursive computation called 535,828,591 times (!!!) the same trivial fib() (very often a one, that has been somewhere else already calculated
---- some even hundreds of millions many times already ~ 102,334,155x times... as fib( 3 ) ), spawning as many as 267,914,295 just-[CONCURRENT] code-execution blocks, enqueued for but 8-workers, all waiting most of the time but for having dived their spawned children as deep as to reach return 1 and return 2 to later do nothing else but to add a pair of then returned numbers and returning from the expensively spawned own process, a "direct"-method of processing is out of question way smarter and way faster:

int fib_direct( int n ) // PSEUDO-CODE
{   assert(  n > 0      && "EXCEPTION: fib_direct() was called with a wrong parameter value" );
    if (  n == 1
       || n == 2
          ) return n;
 // ---------------------------- .ALLOC + .SET 
    int fib_[ max(4,n) ];
        fib_[3] = 3;
        fib_[4] = 5;
 // ---------------------------- .LOOP LESS THAN N-TIMES
    for(           int i = 5; i <= n; i++ )
    {   fib_[i] = fib_[i-2]
                + fib_[i-1];
        }
 // ---------------------------- .RET
    return fib_[n];
    }

A bit more efficient implementation ( still just a single thread and still just interpreted ) managed to easily compute fib_direct( 230000 ) in less than 2.1 [s] which was your compiled code runtime for just a fib( 42 ).

Upvotes: 2

Leos313

Reputation: 5627

The right answer to your question is hardware-dependent. Many factors can influence the performance of code in general: different software strategy can be applied to accelerate execution. However, some of them are more efficient of the other depending on (1) the particular application chosen and (2) on the particular hardware platform chosen. I would like to recommend to profile your application.

Here, you can find a general introduction to the software profiling while, here, a list of software tools that will help you in this task.

In this and this other links, you can find information for profiling OpenMP application (the case of your question).

It is always a good practice to know and understand what is happening under the hood. This will allow you to locate the bottleneck of the famous tris application/code/hardware.