Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

Question

I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).

I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.

The main loop of my program has the following (brief) pseudocode:

int currentNode; 

for(int i = 0; i < numNodes; i++) {
    minimumDistance = DBL_MAX;
    minimumDistanceNode;

    for(int j = 0; j < numNodes; j++) {
        // find distance between 'currentNode' to j-th node
        // ...
        if(jthNodeDistance < minimumDistance) {
            minimumDistance = jthNodeDistance;
            minimumDistanceNode = jthNode;
        }
    }
    currentNode = minimumDistanceNode;
}

And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:

    int currentNode = 0;

    #pragma omp parallel
    {
        #pragma omp single
        {
            for (int i = 1; i < totalNum; i++) {
            miniumum = DBL_MAX;

            __m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
            __m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);

            #pragma omp parallel num_threads(omp_get_max_threads())
            {
                float localMinimum = DBL_MAX;
                float localMinimumNode;

                #pragma omp for 
                for (int j = 0; j < loopEnd; j += 4) {
                    // a number of SSE vector calculations to find distance
                    // between the current node and the four nodes we're looking
                    // at in this iteration of the loop:
                    __m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
                    __m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
                    __m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
                    __m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
                    __m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);

                    float temp[unroll];
                    _mm_store_ps(&temp[0], addXY_0);

                    // skipping stuff here that is about getting the minimum distance and
                    // it's equivalent node, don't think it's massively relevant but
                    // each thread will have its own
                    //  localMinimum
                    //  localMinimumNode
                }


                // updating the global minimumNode in a thread-safe way
                #pragma omp critical (update_minimum)
                {
                    if (localMinimum < minimum) {
                        minimum = localMinimum;
                        minimumNode = localMinimumNode;
                    }
                }
            }

            // within the 'omp single'
            ThisPt = minimumNode;
        }
        }
    }

So my logic is:

omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.

In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.

Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?

Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

Answers (1)

Related Questions