Neraste
Neraste

Reputation: 525

Sequential dot_product in OpenACC Fortran loop

In a Fortran program, I have a large loop with several dot_product calls on small vectors generated within the loop:

program test
        implicit none

        real :: array1(2, 2), array2(2, 2), res(2)
        real :: subarray1(2), subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1, array2) copyout(res)
        !$acc kernels
        !$acc loop independent private(subarray1, subarray2)
        do i = 1, 2
                subarray1(:) = array1(:, i)
                subarray2(:) = array2(:, i)
                res(i) = dot_product(subarray1, subarray2)
        enddo
        !$acc end kernels
        !$acc end data

        print "(2(g0, x))", res
endprogram

When compiled with the PGI compiler, it seems that the accelerated implementation of dot_product uses accelerated loops, and hence prevents to accelerate the main loop better (on gang and vector):

test:
     11, Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     14, Loop is parallelizable
         Generating Tesla code
         14, !$acc loop gang ! blockidx%x
         15, !$acc loop vector(32) ! threadidx%x
         17, !$acc loop vector(32) ! threadidx%x
             Generating implicit reduction(+:subarray1$r)
     14, CUDA shared memory used for subarray2,subarray1
     15, Loop is parallelizable
     17, Loop is parallelizable

As seen in the logs, it uses implicit reduction and shared memory for the loop private vectors.

Is there a way to force dot_product to run sequentially?

Upvotes: 0

Views: 169

Answers (1)

Mat Colgrove
Mat Colgrove

Reputation: 5646

Is there a way to force dot_product to run sequentially?

So long as you don't mind the array syntax being run sequentially as well, just add "gang vector" to the loop directive.

% cat test.f90
program test
        implicit none

        real :: array1(2, 2), array2(2, 2), res(2)
        real :: subarray1(2), subarray2(2)
        integer :: i

        array1 = 1
        array2 = 2

        !$acc data copyin(array1, array2) copyout(res)
        !$acc kernels loop gang vector private(subarray1, subarray2)
        do i = 1, 2
                subarray1(:) = array1(:, i)
                subarray2(:) = array2(:, i)
                res(i) = dot_product(subarray1, subarray2)
        enddo
        !$acc end data

        print "(2(g0, x))", res
endprogram
% nvfortran -acc -Minfo=accel test.f90
test:
     11, Generating copyin(array1(:,:)) [if not already present]
         Generating copyout(res(:)) [if not already present]
         Generating copyin(array2(:,:)) [if not already present]
     13, Loop is parallelizable
         Generating Tesla code
         13, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
         14, !$acc loop seq
         16, !$acc loop seq
     13, Local memory used for subarray2,subarray1
     14, Loop is parallelizable
     16, Loop is parallelizable

Upvotes: 1

Related Questions