Sequential dot_product in OpenACC Fortran loop

Question

In a Fortran program, I have a large loop with several dot_product calls on small vectors generated within the loop:

program test
        implicit none

        real :: array1(2, 2), array2(2, 2), res(2)
        real :: subarray1(2), subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1, array2) copyout(res)
        !$acc kernels
        !$acc loop independent private(subarray1, subarray2)
        do i = 1, 2
                subarray1(:) = array1(:, i)
                subarray2(:) = array2(:, i)
                res(i) = dot_product(subarray1, subarray2)
        enddo
        !$acc end kernels
        !$acc end data

        print "(2(g0, x))", res
endprogram

When compiled with the PGI compiler, it seems that the accelerated implementation of dot_product uses accelerated loops, and hence prevents to accelerate the main loop better (on gang and vector):

test:
     11, Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     14, Loop is parallelizable
         Generating Tesla code
         14, !$acc loop gang ! blockidx%x
         15, !$acc loop vector(32) ! threadidx%x
         17, !$acc loop vector(32) ! threadidx%x
             Generating implicit reduction(+:subarray1$r)
     14, CUDA shared memory used for subarray2,subarray1
     15, Loop is parallelizable
     17, Loop is parallelizable

As seen in the logs, it uses implicit reduction and shared memory for the loop private vectors.

Is there a way to force dot_product to run sequentially?

Mat Colgrove · Accepted Answer

Is there a way to force dot_product to run sequentially?

So long as you don't mind the array syntax being run sequentially as well, just add "gang vector" to the loop directive.

% cat test.f90
program test
        implicit none

        real :: array1(2, 2), array2(2, 2), res(2)
        real :: subarray1(2), subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1, array2) copyout(res)
        !$acc kernels loop gang vector private(subarray1, subarray2)
        do i = 1, 2
                subarray1(:) = array1(:, i)
                subarray2(:) = array2(:, i)
                res(i) = dot_product(subarray1, subarray2)
        enddo
        !$acc end data

        print "(2(g0, x))", res
endprogram
% nvfortran -acc -Minfo=accel test.f90
test:
     11, Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     13, Loop is parallelizable
         Generating Tesla code
         13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
         14, !$acc loop seq
         16, !$acc loop seq
     13, Local memory used for subarray2,subarray1
     14, Loop is parallelizable
     16, Loop is parallelizable

Sequential dot_product in OpenACC Fortran loop

Answers (1)

Related Questions