GreenEye
GreenEye

Reputation: 163

Multi-threaded speed-up only after making array private

I am trying to learn multi-threaded programming using openmp.

To begin with, I was testing out a nested loop with a large number of array access operations, and then parallelizing it. I am attaching the code below. Basically, I have this fairly large array tmp in the interior loop, and if I make it shared so that every thread can access and change it, my code actually slows down with increasing number of threads. I have written it so that every thread writes the exact same values to array tmp. When I make tmp private, I get speed up proportional to the number of threads. The no. of operations seem to me to be exactly the same in both cases. Why is it slowing down when tmp is shared ? Is it because different threads try to access the same address at the same time ?

int main(){
    int k,m,n,dummy_cntr=5000,nthread=10,id;
    long num=10000000;
    double x[num],tmp[dummy_cntr];
    double tm,fact;
    clock_t st,fn;

    st=clock();
    omp_set_num_threads(nthread);
#pragma omp parallel private(tmp)
    {
        id = omp_get_thread_num();
        printf("Thread no. %d \n",id);
#pragma omp for
        for (k=0; k<num; k++){
            x[k]=k+1;
            for (m=0; m<dummy_cntr; m++){
                tmp[m] = m;
            }
        }
    }
    fn=clock();
    tm=(fn-st)/CLOCKS_PER_SEC;
}

P.S.: I am aware that using clock() here doesn't really give the correct time. I have to divide it by the no. of threads in this case to get a similar output as given by "time ./a.out".

Upvotes: 1

Views: 339

Answers (2)

user2088790
user2088790

Reputation:

Your code has race conditions in tmp and m. I don't know what you are really trying to do but this link might be helpful Fill histograms (array reduction) in parallel with OpenMP without using a critical section

I tried cleaning up your code. This code allocates memory fortmp for each thread which solves your problem with false sharing in tmp.

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main() {
    int k,m,dummy_cntr=5000;
    long num=10000000;
    double *x, *tmp;
    double dtime;

    x = (double*)malloc(sizeof(double)*num);

    dtime = omp_get_wtime();
    #pragma omp parallel private(tmp, k)
    {
        tmp = (double*)malloc(sizeof(double)*dummy_cntr);
        #pragma omp for
        for (k=0; k<num; k++){
            x[k]=k+1;
            for (m=0; m<dummy_cntr; m++){
                tmp[m] = m;
            }
        }
        free(tmp);
    }
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);
    free(x);
    return 0;
}

Compiled with

gcc -fopenmp -O3 -std=c89 -Wall -pedantic foo.c

Upvotes: 1

Pragmateek
Pragmateek

Reputation: 13396

This may be due to cache contention: if a part of the array is accessed by two threads or more it will be cached multiple times, one copy for each core: when one core needs to access it, if the data have been changed, it will need to fetch the latest version from another core cache which takes some time.

Upvotes: 5

Related Questions