Reputation: 11

combining openmp and sse instructions

the original code looks like

for(i=0;i<20;i++){
    if(){
        do(); 
    }
    else{

        num2 = _mm_set_pd(Phasor.imaginary, Phasor.real);

        for(int k=0; k<SamplesIneachPeriodCeil[iterationIndex]; k++) 
        {
            /*SamplesIneachPeriodCeil[iterationIndex] is in range of 175000*/

            num1 = _mm_loaddup_pd(&OutSymbol[k].real);
            num3 = _mm_mul_pd(num2, num1);
            num1 = _mm_loaddup_pd(&OutSymbol[k].imaginary);
            num2 = _mm_shuffle_pd(num2, num2, 1);
            num4 = _mm_mul_pd(num2, num1);
            num3 = _mm_addsub_pd(num3, num4);
            num2 = _mm_shuffle_pd(num2, num2, 1);
            num5 = _mm_set_pd(InSymbolInt8[k],InSymbolInt8[k] );
            num6 = _mm_mul_pd(num3, num5);
            num7 = _mm_set_pd(Out[k].imaginary,Out[k].real);
            num8 = _mm_add_pd(num7,num6);
            _mm_storeu_pd((double *)&Out[k], num8);

        }
        Out = Out + SamplesIneachPeriodCeil[iterationIndex];
    }
}

this code gives me speed of ard 15milsec

when i modified the code to include openmp as

note::here i am including only the else portion

else{
    int size = SamplesIneachPeriodCeil[iterationIndex];

#pragma omp parallel num_threads(2) shared(size)
    {
        int start,end,tindex,tno,no_of_iteration;
        tindex = omp_get_thread_num();
        tno = omp_get_num_threads();
        start = tindex * size / tno;
        end = (1+ tindex)* size / tno ;
        num2 = _mm_set_pd(Phasor.imaginary, Phasor.real);
        int k;
        for(k = start ; k < end; k++){


            num1 = _mm_loaddup_pd(&OutSymbol[k].real);
            num3 = _mm_mul_pd(num2, num1);
            num1 = _mm_loaddup_pd(&OutSymbol[k].imaginary);
            num2 = _mm_shuffle_pd(num2, num2, 1);
            num4 = _mm_mul_pd(num2, num1);
            num3 = _mm_addsub_pd(num3, num4);
            //_mm_storeu_pd((double *)&newSymbol, num3);
            num2 = _mm_shuffle_pd(num2, num2, 1);
            num5 = _mm_set_pd(InSymbolInt8[k],InSymbolInt8[k] );
            num6 = _mm_mul_pd(num3, num5);
            num7 = _mm_set_pd(Out[k].imaginary,Out[k].real);
            num8 = _mm_add_pd(num7,num6);
            _mm_storeu_pd((double *)&Out[k], num8);


        }
    }
    Out = Out + size;
}

the speed this code show is somewhere ard 30 milsec

so i was wondering if i have done anything wrong here.

Upvotes: 1

Answers (2)

Walter

Reputation: 45434

You should start your parallel region outside of the outer loop (over i) and parallelize the for loop over k using omp for. All variables used inside the loops (num1, num2 ...) are best declared only within them so that they are automatically private (actually, most of them could be re-used, but the compiler should find that our anyway).

Upvotes: 0

Hristo Iliev

Reputation: 74395

You are doing nothing to distribute the execution of the loop between the two threads. You are just creating a parallel region with two threads and those threads execute exactly the same code. What you might want to do is to move the parallel region to only encompass the for loop and use the work-sharing construct:

~~int k; #pragma omp parallel for num_threads(2) ... for(k = start ; k < end; k++){ ... }~~

Thanks to Tudor for the correction. Your code is correctly parallelised but you have a parallel region inside a loop. Entering and exiting a parallel region is associated with some overhead. Usually this is described as "fork/join model" in which a team of threads is created on entering the region and then all threads are joined to the master on exiting. Most OpenMP runtimes use various thread pooling techniques to decrease the overhead but it is still there.

Your loop runs for 15 milliseconds. This is already fast enough compared to the OpenMP overhead and thus the overhead becomes visible. Think of moving the parallel region over the outer loop and the overhead should be reduced by a factor of up to 20 (depends on how often the else branch is taken) but you might still not see an improvement in the computation time.

Parallelisation is only aplicable to programs where the problem is large enough so that the communication or the synchronisation overhead would be negligible or at least small in comparison to the computation time.

Upvotes: 2

combining openmp and sse instructions

Answers (2)

Related Questions