CUDA Matrix Multiply on Fortran is slower than C

Question

I am performing a basic Matrix Multiply using CUDA Fortran and C without any optimizations. Both Fortran and C are doing the exact same thing but the execution time for Fortran is slower.

C Kernel

#define idx(x,y,z) x*y + z
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k>>(d_mat, d_matT, d_matSym);
cudaDeviceSynchronize();

Fortran Kernel

attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

    integer :: tx, ty, k
    real*8 :: accum
    real*8, dimension(:,:) :: d_mat, d_matT, d_matSym
    
    tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
    ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

    if (tx <= SIZE_ .and. ty <=SIZE_) then
        accum = 0.0
        do k=1, SIZE_
            accum = accum + d_mat(tx,k) * d_matT(k,ty)
        end do
        d_matSym(tx,ty) = accum
    end if
end subroutine matrixMultiply

!Call
type(dim3) :: grid_dim, blk_dim
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()

The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.

If it is the memory access then in both cases, the K loop accesses the one matrix contiguously and another access jumps by SIZE.

Both produce the same results.

For 16384 x 16384 Matrix, C : 5.4 sec Fortran : 6.3 sec

GPU: Tesla V100 32GB

I am not sure what am I doing wrong here?

Robert Crovella · Accepted Answer

First of all, I suggest that performance questions include complete codes. I generally need to be able to run stuff, and you can save me some typing. Sure, you can leave stuff out. Sure, I can probably figure out what it is. But I'm less likely to help you that way, and I suspect I'm not alone in that view. My advice: Make it easy for others to help you. I've given examples of what would be useful below.

On to the question:

The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.

TL;DR: Your claim ("that should not be a problem") is evidently not supportable. The difference between a 1D allocation and a 2D allocation matters, not only from a storage perspective but also from an index-calculation perspective. If you're sensitive to the length of this answer, skip to note D at the bottom of this post.

Details:

When we have a loop like this:

    for (k=0; k


If the compiler can discover that the indexing calculation(s) can simply be "incremented" as the loop proceeds, this will be fairly simple.  OTOH, if the indexing must be resolved taking into account a 2D calculation (e.g. row*width+column) previously referenced at each loop iteration, this will be more complicated.   An important factor here is also whether the array width is usable/used by the compiler at compile time.
This appears to be what is happening, which I suggest is the difference in timing.  We'll consider 2 proof points for this.

SASS analysis

If we compare the SASS between realizations, we see some important differences.  In order to do this, we'll need full examples:
CUDA Fortran:
$ cat t11.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 16384
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

          integer :: tx, ty, k
          REAL(DP) :: accum
          REAL(DP), dimension(:,:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            do k=1, SIZE_
              accum = accum + d_mat(tx,k) * d_matT(k,ty)
            end do
            d_matSym(tx,ty) = accum
          end if
        end subroutine matrixMultiply


end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:,:)   :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_,SIZE_),d_matT(SIZE_,SIZE_),d_matSym(SIZE_,SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()


   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11.cuf -o t11
$ nvprof ./t11
==38508== NVPROF is profiling process 38508, command: ./t11
==38508== Profiling application: ./t11
==38508== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  12.7168s         2  6.35838s  6.35229s  6.36447s  mmk_matrixmultiply_
                    0.00%  9.2480us         6  1.5410us  1.3120us  2.1760us  [CUDA memcpy HtoD]
      API calls:   97.08%  12.7260s         9  1.41400s  10.596us  6.36561s  cudaFree
                    2.85%  374.24ms         9  41.583ms  3.5780us  369.55ms  cudaMalloc
                    0.03%  4.5190ms         4  1.1297ms  1.0551ms  1.2608ms  cuDeviceTotalMem
                    0.02%  2.5060ms       404  6.2020us     111ns  1.2057ms  cuDeviceGetAttribute
                    0.01%  1.1501ms         4  287.52us  33.014us  1.0465ms  cuDeviceGetName
                    0.00%  120.00us         6  20.000us  5.1530us  62.523us  cudaMemcpy
                    0.00%  99.593us         2  49.796us  44.275us  55.318us  cudaLaunchKernel
                    0.00%  43.414us         1  43.414us  43.414us  43.414us  cudaDeviceSynchronize
                    0.00%  11.563us         4  2.8900us  1.4400us  5.5870us  cuDeviceGetPCIBusId
                    0.00%  1.7030us         8     212ns     124ns     580ns  cuDeviceGet
                    0.00%     901ns         3     300ns     163ns     534ns  cuDeviceGetCount
                    0.00%     855ns         4     213ns     183ns     263ns  cuDeviceGetUuid
$

NVIDIA HPC SDK, 2021.2, Tesla V100
We see that the execution duration is about 6.3 seconds, in line with your report.
CUDA C++:
$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {

    //Global inidices
    int tx = blockIdx.y * blockDim.y + threadIdx.y;
    int ty = blockIdx.x * blockDim.x + threadIdx.x;
    int k;

    if (tx < SIZE && ty < SIZE) {
        double accum = 0.0;
        //Accumulation for (tx,ty) position
        for (k=0; k>>(d_mat, d_matT, d_matSym);
  matrixMultiply<<>>(d_mat, d_matT, d_matSym);
  cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  11.0379s         2  5.51893s  5.50503s  5.53282s  matrixMultiply(double*, double*, double*)
      API calls:   97.80%  11.0379s         1  11.0379s  11.0379s  11.0379s  cudaDeviceSynchronize
                    2.15%  242.74ms         3  80.914ms  1.9747ms  238.78ms  cudaMalloc
                    0.04%  4.3105ms         4  1.0776ms  1.0728ms  1.0890ms  cuDeviceTotalMem
                    0.01%  1.4584ms       404  3.6090us     110ns  163.21us  cuDeviceGetAttribute
                    0.00%  142.28us         4  35.571us  32.475us  41.980us  cuDeviceGetName
                    0.00%  51.695us         2  25.847us  7.0790us  44.616us  cudaLaunchKernel
                    0.00%  11.198us         4  2.7990us  1.6260us  5.5950us  cuDeviceGetPCIBusId
                    0.00%  1.7930us         3     597ns     149ns  1.2200us  cuDeviceGetCount
                    0.00%  1.6590us         8     207ns     120ns     704ns  cuDeviceGet
                    0.00%     733ns         4     183ns     161ns     215ns  cuDeviceGetUuid
$

CUDA 11.2, same V100 machine
Again we see ~5.5 sec kernel duration timing, lining up with your report.  Now let's compare the SASS.  We use the cubojdump utility to extract it, and in both cases it is evident that the compiler is unrolling the kernel loop in some fashion. The key operation is DFMA, Double-precision Fused Multiply Add, which is called once per loop iteration, to compute a single product (and keep a running sum).  Therefore, I will extract just a short sequence of DFMA operations in each case, in the midst of the unrolled loop, to get an idea of the per-FLOP "cost" and "overhead".
CUDA Fortran:
/*1670*/                   LDG.E.64.SYS R18, [R12+0x1b8] ;                 /* 0x0001b8000c127381 */
                                                                           /* 0x000ea200001eeb00 */
/*1680*/                   IADD3 R20, P0, R16, R29, RZ ;                   /* 0x0000001d10147210 */
                                                                           /* 0x001fc60007f1e0ff */
/*1690*/                   LDG.E.64.SYS R22, [R12+0x1b0] ;                 /* 0x0001b0000c167381 */
                                                                           /* 0x000e6400001eeb00 */
/*16a0*/                   IMAD.X R21, R17, 0x1, R27, P0 ;                 /* 0x0000000111157824 */
                                                                           /* 0x000fe400000e061b */
/*16b0*/                   LDG.E.64.SYS R24, [R16] ;                       /* 0x0000000010187381 */
                                                                           /* 0x00006c00001eeb00 */
/*16c0*/                   LDG.E.64.SYS R16, [R20] ;                       /* 0x0000000014107381 */
                                                                           /* 0x001ea200001eeb00 */
/*16d0*/                   DFMA R22, R24, R22, R14 ;                       /* 0x000000161816722b */
                                                                           /* 0x002084000000000e */
/*16e0*/                   IADD3 R14, P0, R20, R29, RZ ;                   /* 0x0000001d140e7210 */
                                                                           /* 0x001fca0007f1e0ff */
/*16f0*/                   IMAD.X R15, R21, 0x1, R27, P0 ;                 /* 0x00000001150f7824 */
                                                                           /* 0x000fe200000e061b */
/*1700*/                   DFMA R16, R16, R18, R22 ;                       /* 0x000000121010722b */

The above sequence repeats within the unrolled loop area.  We note that there are 2 DFMA operations, 4 LDG operations (makes sense - two per multiply) and the remaining instructions (4) presumably constitute indexing updates as the loop proceeds through its iterations.
CUDA C++:
/*02d0*/                   LDG.E.64.SYS R16, [R6+0x40000] ;         /* 0x0400000006107381 */
                                                                    /* 0x001f2800001eeb00 */
/*02e0*/                   LDG.E.64.SYS R24, [R4+0x10] ;            /* 0x0000100004187381 */
                                                                    /* 0x000f2200001eeb00 */
/*02f0*/                   DFMA R10, R20, R10, R22 ;                /* 0x0000000a140a722b */
                                                                    /* 0x0200860000000016 */
/*0300*/                   LDG.E.64.SYS R22, [R6+0x60000] ;         /* 0x0600000006167381 */
                                                                    /* 0x001f6800001eeb00 */
/*0310*/                   LDG.E.64.SYS R20, [R4+0x18] ;            /* 0x0000180004147381 */
                                                                    /* 0x000f6200001eeb00 */
/*0320*/                   DFMA R14, R8, R14, R10 ;                 /* 0x0000000e080e722b */

Again that sequence repeats.  We see 2 DFMA and 4 LDG instructions, but no other instructions in the repeating sequence.  We note that the LDG instructions seem to have offsets that can be incorporated at compile-time into the instruction, and these offsets eliminate the need for any extra instructions, because they are only "incrementing" previously computed offsets (we also note that the "increments" are by what appear to be column-wise offsets in one case, and row-wise offsets in the other, just as we would expect to pick up the multiplication operands in the kernel loop.)

Refactoring

Can we refactor the Fortran code to make it work like the C++ code?  Yes:
$ cat t11a.cuf
module mmk
   USE cudafor
   !
   ! Definition of symbols for real types (RP)
   !
   IMPLICIT NONE
   !
   INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6,   37)     ! REAL32
   INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307)     ! REAL64
   INTEGER, PARAMETER :: SIZE_ = 16384
   !

   Contains

        attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)

          integer :: tx, ty, k
          REAL(DP) :: accum
          REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym

          tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
          ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y

          if (tx <= SIZE_ .and. ty <=SIZE_) then
            accum = 0.0
            do k=1, SIZE_
              accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx)
            end do
            d_matSym((ty-1)*SIZE_+tx) = accum
          end if
        end subroutine matrixMultiply


end module mmk

PROGRAM Test
   !
   ! This is the main program for Test
   !
   USE cudafor
   USE mmk

   !
   IMPLICIT NONE
   !
   REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:)     :: d_mat, d_matT, d_matSym
   !
   INTEGER    :: err, i1, i2
   type(dim3) :: grid_dim, blk_dim
   !
   ! Allocate storage for the arrays
   !
   Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_))
   !
   ! invoke the kernel
   !

   !Call
   grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
   blk_dim = dim3(32, 32, 1)
   call matrixMultiply<<>>(d_mat, d_matT, d_matSym)
   call matrixMultiply<<>>(d_mat, d_matT, d_matSym)
   err = cudaDeviceSynchronize()


   !
   ! Free storage for the arrays
   !
   Deallocate(d_mat,d_matT,d_matSym)
   !
END PROGRAM Test
$ nvfortran t11a.cuf -o t11a
$ nvprof ./t11a
==45544== NVPROF is profiling process 45544, command: ./t11a
==45544== Profiling application: ./t11a
==45544== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  10.8169s         2  5.40847s  5.39118s  5.42576s  mmk_matrixmultiply_
                    0.00%  9.2160us         6  1.5360us  1.3120us  2.1440us  [CUDA memcpy HtoD]
      API calls:   96.72%  10.8240s         9  1.20266s  12.102us  5.42594s  cudaFree
                    3.22%  360.34ms         9  40.038ms  3.2760us  355.64ms  cudaMalloc
                    0.04%  4.3488ms         4  1.0872ms  1.0598ms  1.1593ms  cuDeviceTotalMem
                    0.02%  2.3633ms       404  5.8490us     111ns  869.22us  cuDeviceGetAttribute
                    0.00%  144.50us         4  36.125us  33.254us  41.317us  cuDeviceGetName
                    0.00%  106.43us         6  17.737us  4.7910us  51.453us  cudaMemcpy
                    0.00%  101.46us         2  50.732us  44.479us  56.985us  cudaLaunchKernel
                    0.00%  20.926us         1  20.926us  20.926us  20.926us  cudaDeviceSynchronize
                    0.00%  7.2850us         4  1.8210us     491ns  4.6210us  cuDeviceGetPCIBusId
                    0.00%  1.7650us         8     220ns     118ns     672ns  cuDeviceGet
                    0.00%     926ns         4     231ns     218ns     264ns  cuDeviceGetUuid
                    0.00%     614ns         3     204ns     130ns     310ns  cuDeviceGetCount
$

And we see that the performance is approximately the same as the CUDA C++ version.
Notes:
A. I've omitted the SASS analysis for the final variant.  Exercise is left to the reader.  However the unrolled loop portion is a repeating sequence of LDG-LDG-DFMA just as is witnessed in the CUDA C++ version.
B. You might have two questions: 1. Why is it this way?  I've tried to answer that.  2. Should it be this way/does it need to be this way?  I have not answered that, I believe that is a combination of some Fortran statements that I don't wish to make, as well as a question for the compiler engineers.  If you think the 2D realization should duplicate the 1D realization peformance, I suggest filing a bug.
C. It was a little while of head-scratching before I noticed you had done this in the C++ case but not the fortran case:
int tx = blockIdx.y * blockDim.y + threadIdx.y;
     ^            ^            ^             ^

D. One of the reasons why I think this discrepancy is not trivially resolvable at compile-time is that the compiler 2D array handling in CUDA Fortran (whether needed or not) seems to involve the creation of a run time meta-data package which includes the array width, perhaps amongst other information.  This meta-data package is transmitted to the device separately from the array data payload - this much can be surmised from the profiler output above, or by running with --print-gpu-trace to see it more clearly.  In your CUDA C++ case, effectively, the array width (SIZE) is known at compile-time.  In the Fortran case, it is not  (or alternatively, if it is, the compiler does not seem to be propagating this knowledge through the kernel code generation).  You can see that the "improved" addressing mode above suggests offsets that are known at compile-time.  Whether or not this "should" be resolvable with some more careful treatment of Fortran I cannot say.  However I would expect that some sort of contiguous declaration/decoration, by itself is not sufficient to resolve this discrepancy for the compiler.  That declaration alone would not obviate the need for on-the fly runtime indexing calculations, based on array width.

CUDA Matrix Multiply on Fortran is slower than C

Answers (1)

Related Questions