Alex Alex
Alex Alex

Reputation: 15

C++ OpenMP working really slow on matrix-vector product

So, I'm making matrix-vector product using openMP, but I've noticed it's working reallllly slow. After some times trying to figure out whats wrong I just deleted all code in parallel section and its still SLOW. What can be problem here? (n = 1000)

Here is time results for 1, 2 and 4 cores.

seq_method time = 0.001047194215062

parrallel_method (1) time = 0.001050273191140 seq - par = -0.000003078976079 seq/par = 0.997068404578433

parrallel_method (2) time = 0.001961992426004 seq - par = -0.000914798210943 seq/par = 0.533740192460558

parrallel_method (4) time = 0.004448095121916 seq - par = -0.003400900906854 seq/par = 0.235425319459132

Even when I delete code from parallel section - it doesnt change much.

void parallel_method(float A[n][n], float B[n], float C[n], int thr_num)
{
    double t1, t2;
    float tmp = 0;
    int i, j;
    t1 = omp_get_wtime();


    omp_set_dynamic(0);
    omp_set_num_threads(thr_num);
#pragma omp parallel for private(tmp, j, i)
    for (i = 0; i < n; i++) {
        tmp = 0;
        for (j = 0; j < n; j++) {
            tmp += A[i][j] * B[j];
        }
#pragma omp atomic
        C[i] += tmp;
    }

    //////
    t2 = omp_get_wtime();
    if (show_c) print_vector(C);
    par = t2 - t1;
    printf("\nparrallel_method (%d) time = %.15f", thr_num, par);
    printf("\nseq - par = %.15f", seq - par);
    printf("\nseq/par = %.15f\n", seq / par);
}

Code: https://pastebin.com/Q20t5DLk

Upvotes: 0

Views: 315

Answers (1)

Alain Merigot
Alain Merigot

Reputation: 11537

I tried to reproduce your problem and was not able to do that. I have a completely coherent behavior.

n=100
sequential_method (0) time = 0.000023339001928
parallel_method (1) time = 0.000023508997401
parallel_method (2) time = 0.000013864002540
parallel_method (4) time = 0.000008979986887

n=1000
sequential_method (0) time = 0.001439775005565
parallel_method (1) time = 0.001437967992388
parallel_method (2) time = 0.000701391996699
parallel_method (4) time = 0.000372130998080

n=10000
sequential_method (0) time = 0.140988592000213
parallel_method (1) time = 0.133375317003811
parallel_method (2) time = 0.077803490007180
parallel_method (4) time = 0.044142695999355

Except for small size, where thread overhead is significant, the results are more or less what is expected.

What I did:

  • all measures are done in the same run

  • I run all functions once without timing to warm-up the caches

In real code estimations, I would have also

  • time several successive executions of the same function, especially if time is short in order to reduce small variations

  • run several experiments and keep the smallest one to suppress outliers. (I prefer minimum, but you can also compute the average).

You should have posted all you code, and I do not know what is your methodology. But I think that your estimations where done in different runs and without warming up the caches. For this code, cache impact is very important and cores have to store the same information (B). And the problem is not large enough to benefit from the larger L1/L2 caches. These multiples loads may explain the worse performances of parallel code.

On last remark on your code. Every thread will have their own values of i. Hence C[i] can be accessed by only one thread and the atomic pragma is useless.

Upvotes: 1

Related Questions