Reputation: 959
Why does the following code:
#include <iostream>
int main(int argc, char const *argv[])
{
int sum = 0;
int *array;
array = new int [100];
#pragma acc enter data create(array[0:100],sum)
#pragma acc parallel loop present(array[0:100])
for (int i = 0; i < 100; ++i)
{
array[i] = 1;
}
#pragma acc parallel loop present(array[0:100],sum) reduction(+:sum)
for (int i = 0; i < 100; ++i)
{
sum += array[i];
}
#pragma acc exit data delete(array[0:100]) copyout(sum)
std::cout << sum << std::endl;
return 0;
}
Return different results at every execution?
$ pgcpp -acc -Minfo main.cpp
main:
7, Generating enter data create(sum)
Generating enter data create(array[:100])
Generating present(array[:100])
Accelerator kernel generated
12, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
7, Generating Tesla code
15, Generating present(array[:100])
Generating present(sum)
Accelerator kernel generated
18, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
20, Sum reduction generated for sum
15, Generating Tesla code
25, Generating exit data copyout(sum)
Generating exit data delete(array[:100])
$ ./a.out
100
$ ./a.out
200
$ ./a.out
300
$ ./a.out
400
According to the OpenACC standard:
On an exit data directive, the data is copied back to the local memory and deallocated.
It would seem that sum
is not deallocated and instead re-used (and incremented) at every run of the program. Furthermore, the +
operator in the reduction
directive initializes the reduction variable to 0
, so this shouldn't happen even if sum
weren't deallocated between executions.
I can avoid this behavior by either using copyin
instead of create
for sum
in the enter data
directive, or setting sum = 0
in a single gang, single worker kernel:
#pragma acc parallel present(sum) num_gangs(1) num_workers(1)
sum = 0;
But this is not satisfactory as it requires either a costly host to device data copy, respectively a kernel launch. Why is my program behaving thusly?
Upvotes: 1
Views: 3606
Reputation: 152143
You are misinterpreting the meaning of the reduction operator initialization value. Referring to the openACC specification, pp 20-21:
The reduction clause is allowed on the parallel construct. It specifies a reduction operator and one or more scalar variables. For each variable, a private copy is created for each parallel gang and initialized for that operator. At the end of the region, the values for each gang are combined using the reduction operator, and the result combined with the value of the original variable and stored in the original variable.
This means that the overall reduction problem is broken down into pieces, each piece handled by a gang. The part of the problem handled by the gang will use the indicated initialization value for the reduction variable. However when the final result is created, the individual results from each gang will be combined with the value of the original variable (sum
in your case), and that will be the result.
So you must properly initialize sum
, perhaps using one of the methods you outline in your question.
Also, although it's not the crux of the issue, note that niether deallocation nor allocation has any effect on the contents of memory. A new variable allocated in that spot, without proper initialization, will pick up the value currrently in that location.
Upvotes: 2