Me- La Ría
Me- La Ría

Reputation: 51

pragma omp for of atomic operations on a histogram

I'm having trouble in efficiently parallelizing the next line of code:

#   pragma omp for nowait
    for (int i = 0; i < M; i++) { 
#       pragma omp atomic
        centroids[points[i].cluster].points_in_cluster++;
    }

This runs, I guess due to the omp for overhead, slower than this:

#   pragma omp single nowait
    for (int i = 0; i < M; i++) { 
        centroids[points[i].cluster].points_in_cluster++;
    }

Is there any way to make this go faster?

Upvotes: 1

Views: 162

Answers (1)

paleonix
paleonix

Reputation: 3049

Theory

While atomics are certainly better than locks or critical regions due to their implementation in hardware on most platforms, they are still in general to be avoided if possible as they do not scale well, i.e. increasing the number of threads will create more atomic collisions and therefore more overhead. Further hardware-/implementation-specific bottlenecks due to atomics are described in the comments below the question and this answer by @PeterCordes.

The alternative to atomics is a parallel reduction algorithm. Assuming that there are much more points than centroids, one can use OpenMP's reduction clause to let every thread have a private version of centroids. These private histograms will be consolidated in an implementation-defined fashion after filling them.

There is no guarantee that this technique is faster than using atomics in every possible case. It could not only depend on the size of the two index spaces, but also on the data as it determines the number of collisions when using atomics. A proper parallel reduction algorithm is in general still expected to scale better to big numbers of threads.

Practice

The problem with using reduce in your code is the Array-of-Structs (AoS) data layout. Specifying

# pragma omp for reduction(+: centroids[0:num_centroids])

will produce an error at build time, as the compiler does not know how to reduce the user-defined type of centroids. Specifying

# pragma omp for reduction(+: centroids[0:num_centroids].points_in_cluster)

does not work either as it is not a valid OpenMP array section.

One can try to use an custom reduction here, but I do not know how to combine a user-defined reduction with OpenMP array sections (see the edit at the end). Also it could be very inefficient to create all the unused variables in the centroid struct on every thread.

With a Struct-of-Array (SoA) data layout you would just have a plain integer buffer, e.g. int *points_in_clusters, which could then be used in the following way (assuming that there are num_centroids elements in centroids and now points_in_clusters):

#   pragma omp for nowait reduction(+: points_in_clusters[0:num_centroids])
    for (int i = 0; i < M; i++) { 
        points_in_clusters[points[i].cluster]++;
    }

If you cannot just change the data layout, you could still use some scratch space for the OpenMP reduction and afterwards copy the results back to the centroid structs in another loop. But this additional copy operation could eat into the savings from using reduction in the first place.

Using SoA also has benefits for (auto-) vectorization (of other loops) and potentially improves cache locality for regular access patterns. AoS on the other hand can be better for cache locality when encountering random access patterns (e.g. most sorting algorithms if the comparison makes use of multiple variables from the struct).


PS: Be careful with nowait. Does the following work really not depend on the resulting points_in_cluster?

EDIT: I removed my alternative implementation using a user-defined reduction operator as it was not working. I seem to have fixed the problem, but I do not have enough confidence in this implementation (performance- and correctness-wise) to add it back into the answer. Feel free to improve upon the linked code and post another answer.

Upvotes: 3

Related Questions