gab
gab

Reputation: 307

Why is OpenMP slower than sequential execution?

I have the following code:

    omp_set_num_threads(10);
    
    int N = 10000;
    int chunk = 1000;
    int i;
    float a[N];

    for (i=0; i < N; i++) 
    {
            a[i] = i;
    }

    #pragma omp parallel private(i)
    {
            #pragma omp for schedule(dynamic,chunk) nowait
            for (i=0; i < N; i++) a[i] = a[i] + a[i];
    } 

The sequential code is the same, without the two pragma directives.

sequential ~150us parallel ~1100us

That's a huge gap, and I expected it to be the other way around.

Does someone have a clue what's wrong, or does OpenMP have so much to do in the background? Does someone have an example, where I can see that the parallelized for loop is faster?

Thanks

Upvotes: 0

Views: 764

Answers (1)

ShadowRanger
ShadowRanger

Reputation: 155323

Launching ten threads and managing them is more expensive than adding 10,000 elements. If you're on a system with less than 10 true cores, they're going to compete for time slices, and incur context switches and (potentially) more cache misses. And you chose dynamic scheduling, which means there needs to be some synchronization to help the threads figure out which chunks they're going to do (not a lot, but enough to slow things down when the work being distributed is pretty trivial by contrast).

In one anecdote launching a no-op thread and immediately detaching it cost about 10 μs, and those threads didn't need to do anything. In your case, that's 100 μs just for launching the threads, ignoring all the other inefficiencies threading potentially introduces.

Parallelizing helps for big workloads, or when the workers are used many times for many moderate sized tasks (so the cost of launching the threads is a fraction of the work they do). But performing 10,000 additions is chump change to a CPU; you're just not doing enough to benefit from parallelizing it.

Upvotes: 2

Related Questions