OpenACC sum reduction outputs increments sum at each execution

Question

Why does the following code:

#include 

int main(int argc, char const *argv[])
{
    int sum = 0;
    int *array;
    array = new int [100];

    #pragma acc enter data create(array[0:100],sum)

    #pragma acc parallel loop present(array[0:100])
    for (int i = 0; i < 100; ++i)
    {
        array[i] = 1;
    }

    #pragma acc parallel loop present(array[0:100],sum) reduction(+:sum)
    for (int i = 0; i < 100; ++i)
    {
        sum += array[i];
    }

    #pragma acc exit data delete(array[0:100]) copyout(sum)

    std::cout << sum << std::endl;

    return 0;
}

Return different results at every execution?

$ pgcpp -acc -Minfo main.cpp
main:
      7, Generating enter data create(sum)
         Generating enter data create(array[:100])
         Generating present(array[:100])
         Accelerator kernel generated
         12, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
      7, Generating Tesla code
     15, Generating present(array[:100])
         Generating present(sum)
         Accelerator kernel generated
         18, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
         20, Sum reduction generated for sum
     15, Generating Tesla code
     25, Generating exit data copyout(sum)
         Generating exit data delete(array[:100])
$ ./a.out
100
$ ./a.out
200
$ ./a.out
300
$ ./a.out
400

According to the OpenACC standard:

On an exit data directive, the data is copied back to the local memory and deallocated.

It would seem that sum is not deallocated and instead re-used (and incremented) at every run of the program. Furthermore, the + operator in the reduction directive initializes the reduction variable to 0, so this shouldn't happen even if sum weren't deallocated between executions.

I can avoid this behavior by either using copyin instead of create for sum in the enter data directive, or setting sum = 0 in a single gang, single worker kernel:

#pragma acc parallel present(sum) num_gangs(1) num_workers(1)
sum = 0;

But this is not satisfactory as it requires either a costly host to device data copy, respectively a kernel launch. Why is my program behaving thusly?

Robert Crovella · Accepted Answer

You are misinterpreting the meaning of the reduction operator initialization value. Referring to the openACC specification, pp 20-21:

The reduction clause is allowed on the parallel construct. It specifies a reduction operator and one or more scalar variables. For each variable, a private copy is created for each parallel gang and initialized for that operator. At the end of the region, the values for each gang are combined using the reduction operator, and the result combined with the value of the original variable and stored in the original variable.

This means that the overall reduction problem is broken down into pieces, each piece handled by a gang. The part of the problem handled by the gang will use the indicated initialization value for the reduction variable. However when the final result is created, the individual results from each gang will be combined with the value of the original variable (sum in your case), and that will be the result.

So you must properly initialize sum, perhaps using one of the methods you outline in your question.

Also, although it's not the crux of the issue, note that niether deallocation nor allocation has any effect on the contents of memory. A new variable allocated in that spot, without proper initialization, will pick up the value currrently in that location.

OpenACC sum reduction outputs increments sum at each execution

Answers (1)

Related Questions