Reputation: 1
I am working on a parallel sum scan algorithm and my results are incorrect.
I am working on an implementation of the Hillis Steele Scan in OpenMP.
My function is outputting incorrect results:
void prefixSumShared(double *arr, int size, int numThreads)
{
double logsize = ceil(log2(size));
for (int j = 1; j <= logsize; j++)
{
int power = 1 << (j - 1); // 2^(j-1)
int power2 = 1 << j; // 2^j
#pragma omp parallel for num_threads(numThreads)
for (int k = 1; k < size; k++)
{
if (k >= power2)
{
arr[k] = arr[k] + arr[k - power];
}
}
}
}
My output changes for every number of threads I am using.
When the length of the array is numbers between 0 and 9 with 2 threads
{0.000000,1.000000,2.000000,3.000000,4.000000,5.000000,6.000000,7.000000,8.000000,9.000000}
My output is
{0.000000,1.000000,3.000000,7.000000,13.000000,23.000000,37.000000,57.000000,83.000000,119.000000}
Once I figure out how to run this with known values I will change the input array to random values between 0 and 2.
I tried different approaches like the one found in this Youtube video
I tried running it with various thread numbers but kept on getting the incorrect result.
Upvotes: 0
Views: 408
Reputation: 181179
I am working on an implementation of the Hillis Steele Scan in OpenMP.
OpenMP does not provide a model of parallel computation appropriate for the algorithms described in the Hillis & Steele paper you reference. From their introductory comments:
In this article we describe a series of algorithms appropriate for fine-grained parallel computers with general communications. We call these algorithms data parallel algorithms because their parallelism comes from simultaneous operations across large sets of data, rather than from multiple threads of control. The intent is not so much to present new algorithms (most have been described earlier in other contexts), but rather to demonstrate a style of programming that is appropriate for a machine with tens of thousands or even millions of processors. The algorithms described all use O(N) processors to solve problems of size N, typically in O(log N) time.
I am confident in saying that you do not have a machine with the physical and computational characteristics they go on to describe. And OpenMP, although it can make some use of SIMD capabilities of the host machine's hardware, provides fundamentally a multithreading approach to parallelism, not the "data parallel" approach on which the algorithms described in the paper are based.
Possibly you could simulate the needed characteristics via OpenMP or another thread-based approach, but you would need to add fairly fine-grained synchronization, which I anticipate would impact performance enough to make the algorithm uninteresting, except possibly for ginormous data sets.
my function is outputting the incorrect results
This is unsurprising, because it is full of data races. This is where the need for synchronization that I mentioned above comes in. You can reduce that need by introducing an additional array, the same size as the original one. At each iteration of the outer loop, the inner loop reads from one and writes to the other, then on the next iteration you flip. That reduces the need for synchronization to once per outer loop iteration, which OpenMP will provide for you automatically.
Here's a variation that works:
void prefixSumShared(double *arr, double *scratch, size_t size, int numThreads) {
if (size < 2) {
// nothing to do
return;
}
double *src = arr;
double *dest = scratch;
size_t step;
dest[0] = src[0];
for (step = 1; step < size; step <<= 1) {
// Because we're flipping buffers back and forth, we need to copy
// elements that were updated by the previous iteration, but otherwise
// would not be by this iteration:
memcpy(dest + (step >> 1), src + (step >> 1), (step >> 1) * sizeof *arr);
#pragma omp parallel for num_threads(numThreads)
for (size_t k = step; k < size; k++) {
dest[k] = src[k] + src[k - step];
}
// OMP provides an implicit barrier here
scratch = src;
src = dest;
dest = scratch;
}
if (src != arr) {
// Copy elements updated in the last iteration
step >>= 1;
memcpy(arr + step, src + step, (size - step) * sizeof *arr);
}
}
It relies on the caller to provide appropriate scratch space, at least equal in size to the input array. It uses the buffer-flipping approach I described, which makes it a bit more complex than it might be, but saves some copying. An alternative would be to always sum from arr
into scratch
, and then copy back the changes on each iteration, but that would involve asymptotically more copying (O(n log n) instead of O(n)).
I have also removed the do-nothing iterations of the inner loop. They make sense for the computing environment assumed by Hillis & Steele, but not for OpenMP. And I have restructured a bit to remove the need for FP operations.
Upvotes: 3