Reputation: 37
I am performing a basic Matrix Multiply using CUDA Fortran and C without any optimizations. Both Fortran and C are doing the exact same thing but the execution time for Fortran is slower.
C Kernel
#define idx(x,y,z) x*y + z
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {
//Global inidices
int tx = blockIdx.y * blockDim.y + threadIdx.y;
int ty = blockIdx.x * blockDim.x + threadIdx.x;
int k;
if (tx < SIZE && ty < SIZE) {
double accum = 0.0;
//Accumulation for (tx,ty) position
for (k=0; k<SIZE; ++k) {
accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
}
d_matSym[idx(tx,SIZE,ty)] = accum;
}
}
//Call
dim3 grid_dim(SIZE/32, SIZE/32, 1);
dim3 blk_dim(32, 32, 1);
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
cudaDeviceSynchronize();
Fortran Kernel
attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)
integer :: tx, ty, k
real*8 :: accum
real*8, dimension(:,:) :: d_mat, d_matT, d_matSym
tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y
if (tx <= SIZE_ .and. ty <=SIZE_) then
accum = 0.0
do k=1, SIZE_
accum = accum + d_mat(tx,k) * d_matT(k,ty)
end do
d_matSym(tx,ty) = accum
end if
end subroutine matrixMultiply
!Call
type(dim3) :: grid_dim, blk_dim
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()
The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.
If it is the memory access then in both cases, the K loop accesses the one matrix contiguously and another access jumps by SIZE.
Both produce the same results.
For 16384 x 16384 Matrix, C : 5.4 sec Fortran : 6.3 sec
GPU: Tesla V100 32GB
I am not sure what am I doing wrong here?
Upvotes: 2
Views: 892
Reputation: 151799
First of all, I suggest that performance questions include complete codes. I generally need to be able to run stuff, and you can save me some typing. Sure, you can leave stuff out. Sure, I can probably figure out what it is. But I'm less likely to help you that way, and I suspect I'm not alone in that view. My advice: Make it easy for others to help you. I've given examples of what would be useful below.
On to the question:
The difference is that C uses a 1D array whereas Fortran uses 2D. But that should not be a problem since underneath the memory will be contiguous.
TL;DR: Your claim ("that should not be a problem") is evidently not supportable. The difference between a 1D allocation and a 2D allocation matters, not only from a storage perspective but also from an index-calculation perspective. If you're sensitive to the length of this answer, skip to note D at the bottom of this post.
Details:
When we have a loop like this:
for (k=0; k<SIZE; ++k) {
accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
}
If the compiler can discover that the indexing calculation(s) can simply be "incremented" as the loop proceeds, this will be fairly simple. OTOH, if the indexing must be resolved taking into account a 2D calculation (e.g. row*width+column
) previously referenced at each loop iteration, this will be more complicated. An important factor here is also whether the array width
is usable/used by the compiler at compile time.
This appears to be what is happening, which I suggest is the difference in timing. We'll consider 2 proof points for this.
If we compare the SASS between realizations, we see some important differences. In order to do this, we'll need full examples:
CUDA Fortran:
$ cat t11.cuf
module mmk
USE cudafor
!
! Definition of symbols for real types (RP)
!
IMPLICIT NONE
!
INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6, 37) ! REAL32
INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307) ! REAL64
INTEGER, PARAMETER :: SIZE_ = 16384
!
Contains
attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)
integer :: tx, ty, k
REAL(DP) :: accum
REAL(DP), dimension(:,:) :: d_mat, d_matT, d_matSym
tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y
if (tx <= SIZE_ .and. ty <=SIZE_) then
accum = 0.0
do k=1, SIZE_
accum = accum + d_mat(tx,k) * d_matT(k,ty)
end do
d_matSym(tx,ty) = accum
end if
end subroutine matrixMultiply
end module mmk
PROGRAM Test
!
! This is the main program for Test
!
USE cudafor
USE mmk
!
IMPLICIT NONE
!
REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:,:) :: d_mat, d_matT, d_matSym
!
INTEGER :: err, i1, i2
type(dim3) :: grid_dim, blk_dim
!
!
! Allocate storage for the arrays
!
Allocate(d_mat(SIZE_,SIZE_),d_matT(SIZE_,SIZE_),d_matSym(SIZE_,SIZE_))
!
! invoke the kernel
!
!Call
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()
!
! Free storage for the arrays
!
Deallocate(d_mat,d_matT,d_matSym)
!
END PROGRAM Test
$ nvfortran t11.cuf -o t11
$ nvprof ./t11
==38508== NVPROF is profiling process 38508, command: ./t11
==38508== Profiling application: ./t11
==38508== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 12.7168s 2 6.35838s 6.35229s 6.36447s mmk_matrixmultiply_
0.00% 9.2480us 6 1.5410us 1.3120us 2.1760us [CUDA memcpy HtoD]
API calls: 97.08% 12.7260s 9 1.41400s 10.596us 6.36561s cudaFree
2.85% 374.24ms 9 41.583ms 3.5780us 369.55ms cudaMalloc
0.03% 4.5190ms 4 1.1297ms 1.0551ms 1.2608ms cuDeviceTotalMem
0.02% 2.5060ms 404 6.2020us 111ns 1.2057ms cuDeviceGetAttribute
0.01% 1.1501ms 4 287.52us 33.014us 1.0465ms cuDeviceGetName
0.00% 120.00us 6 20.000us 5.1530us 62.523us cudaMemcpy
0.00% 99.593us 2 49.796us 44.275us 55.318us cudaLaunchKernel
0.00% 43.414us 1 43.414us 43.414us 43.414us cudaDeviceSynchronize
0.00% 11.563us 4 2.8900us 1.4400us 5.5870us cuDeviceGetPCIBusId
0.00% 1.7030us 8 212ns 124ns 580ns cuDeviceGet
0.00% 901ns 3 300ns 163ns 534ns cuDeviceGetCount
0.00% 855ns 4 213ns 183ns 263ns cuDeviceGetUuid
$
NVIDIA HPC SDK, 2021.2, Tesla V100
We see that the execution duration is about 6.3 seconds, in line with your report.
CUDA C++:
$ cat t12.cu
#define idx(x,y,z) x*y + z
const int SIZE = 16384;
__global__ void matrixMultiply(double *d_mat, double *d_matT, double *d_matSym) {
//Global inidices
int tx = blockIdx.y * blockDim.y + threadIdx.y;
int ty = blockIdx.x * blockDim.x + threadIdx.x;
int k;
if (tx < SIZE && ty < SIZE) {
double accum = 0.0;
//Accumulation for (tx,ty) position
for (k=0; k<SIZE; ++k) {
accum += d_mat[idx(tx,SIZE,k)] * d_matT[idx(k,SIZE,ty)];
}
d_matSym[idx(tx,SIZE,ty)] = accum;
}
}
int main(){
//Call
dim3 grid_dim(SIZE/32, SIZE/32, 1);
dim3 blk_dim(32, 32, 1);
double *d_mat, *d_matT, *d_matSym;
cudaMalloc(&d_mat, SIZE*SIZE*sizeof(double));
cudaMalloc(&d_matT, SIZE*SIZE*sizeof(double));
cudaMalloc(&d_matSym, SIZE*SIZE*sizeof(double));
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym);
cudaDeviceSynchronize();
}
$ nvcc -o t12 t12.cu -arch=sm_70
$ nvprof ./t12
==40364== NVPROF is profiling process 40364, command: ./t12
==40364== Profiling application: ./t12
==40364== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 11.0379s 2 5.51893s 5.50503s 5.53282s matrixMultiply(double*, double*, double*)
API calls: 97.80% 11.0379s 1 11.0379s 11.0379s 11.0379s cudaDeviceSynchronize
2.15% 242.74ms 3 80.914ms 1.9747ms 238.78ms cudaMalloc
0.04% 4.3105ms 4 1.0776ms 1.0728ms 1.0890ms cuDeviceTotalMem
0.01% 1.4584ms 404 3.6090us 110ns 163.21us cuDeviceGetAttribute
0.00% 142.28us 4 35.571us 32.475us 41.980us cuDeviceGetName
0.00% 51.695us 2 25.847us 7.0790us 44.616us cudaLaunchKernel
0.00% 11.198us 4 2.7990us 1.6260us 5.5950us cuDeviceGetPCIBusId
0.00% 1.7930us 3 597ns 149ns 1.2200us cuDeviceGetCount
0.00% 1.6590us 8 207ns 120ns 704ns cuDeviceGet
0.00% 733ns 4 183ns 161ns 215ns cuDeviceGetUuid
$
CUDA 11.2, same V100 machine
Again we see ~5.5 sec kernel duration timing, lining up with your report. Now let's compare the SASS. We use the cubojdump
utility to extract it, and in both cases it is evident that the compiler is unrolling the kernel loop in some fashion. The key operation is DFMA
, Double-precision Fused Multiply Add, which is called once per loop iteration, to compute a single product (and keep a running sum). Therefore, I will extract just a short sequence of DFMA operations in each case, in the midst of the unrolled loop, to get an idea of the per-FLOP "cost" and "overhead".
CUDA Fortran:
/*1670*/ LDG.E.64.SYS R18, [R12+0x1b8] ; /* 0x0001b8000c127381 */
/* 0x000ea200001eeb00 */
/*1680*/ IADD3 R20, P0, R16, R29, RZ ; /* 0x0000001d10147210 */
/* 0x001fc60007f1e0ff */
/*1690*/ LDG.E.64.SYS R22, [R12+0x1b0] ; /* 0x0001b0000c167381 */
/* 0x000e6400001eeb00 */
/*16a0*/ IMAD.X R21, R17, 0x1, R27, P0 ; /* 0x0000000111157824 */
/* 0x000fe400000e061b */
/*16b0*/ LDG.E.64.SYS R24, [R16] ; /* 0x0000000010187381 */
/* 0x00006c00001eeb00 */
/*16c0*/ LDG.E.64.SYS R16, [R20] ; /* 0x0000000014107381 */
/* 0x001ea200001eeb00 */
/*16d0*/ DFMA R22, R24, R22, R14 ; /* 0x000000161816722b */
/* 0x002084000000000e */
/*16e0*/ IADD3 R14, P0, R20, R29, RZ ; /* 0x0000001d140e7210 */
/* 0x001fca0007f1e0ff */
/*16f0*/ IMAD.X R15, R21, 0x1, R27, P0 ; /* 0x00000001150f7824 */
/* 0x000fe200000e061b */
/*1700*/ DFMA R16, R16, R18, R22 ; /* 0x000000121010722b */
The above sequence repeats within the unrolled loop area. We note that there are 2 DFMA
operations, 4 LDG
operations (makes sense - two per multiply) and the remaining instructions (4) presumably constitute indexing updates as the loop proceeds through its iterations.
CUDA C++:
/*02d0*/ LDG.E.64.SYS R16, [R6+0x40000] ; /* 0x0400000006107381 */
/* 0x001f2800001eeb00 */
/*02e0*/ LDG.E.64.SYS R24, [R4+0x10] ; /* 0x0000100004187381 */
/* 0x000f2200001eeb00 */
/*02f0*/ DFMA R10, R20, R10, R22 ; /* 0x0000000a140a722b */
/* 0x0200860000000016 */
/*0300*/ LDG.E.64.SYS R22, [R6+0x60000] ; /* 0x0600000006167381 */
/* 0x001f6800001eeb00 */
/*0310*/ LDG.E.64.SYS R20, [R4+0x18] ; /* 0x0000180004147381 */
/* 0x000f6200001eeb00 */
/*0320*/ DFMA R14, R8, R14, R10 ; /* 0x0000000e080e722b */
Again that sequence repeats. We see 2 DFMA
and 4 LDG
instructions, but no other instructions in the repeating sequence. We note that the LDG
instructions seem to have offsets that can be incorporated at compile-time into the instruction, and these offsets eliminate the need for any extra instructions, because they are only "incrementing" previously computed offsets (we also note that the "increments" are by what appear to be column-wise offsets in one case, and row-wise offsets in the other, just as we would expect to pick up the multiplication operands in the kernel loop.)
Can we refactor the Fortran code to make it work like the C++ code? Yes:
$ cat t11a.cuf
module mmk
USE cudafor
!
! Definition of symbols for real types (RP)
!
IMPLICIT NONE
!
INTEGER, PARAMETER :: SP = SELECTED_REAL_KIND(6, 37) ! REAL32
INTEGER, PARAMETER :: DP = SELECTED_REAL_KIND(15, 307) ! REAL64
INTEGER, PARAMETER :: SIZE_ = 16384
!
Contains
attributes(global) subroutine matrixMultiply(d_mat, d_matT, d_matSym)
integer :: tx, ty, k
REAL(DP) :: accum
REAL(DP), dimension(:) :: d_mat, d_matT, d_matSym
tx = threadIdx%x + (blockIdx%x - 1) * blockDim%x
ty = threadIdx%y + (blockIdx%y - 1) * blockDim%y
if (tx <= SIZE_ .and. ty <=SIZE_) then
accum = 0.0
do k=1, SIZE_
accum = accum + d_mat((ty-1)*SIZE_+k) * d_matT((k-1)*SIZE_+tx)
end do
d_matSym((ty-1)*SIZE_+tx) = accum
end if
end subroutine matrixMultiply
end module mmk
PROGRAM Test
!
! This is the main program for Test
!
USE cudafor
USE mmk
!
IMPLICIT NONE
!
REAL(DP), ALLOCATABLE, DEVICE, DIMENSION(:) :: d_mat, d_matT, d_matSym
!
INTEGER :: err, i1, i2
type(dim3) :: grid_dim, blk_dim
!
! Allocate storage for the arrays
!
Allocate(d_mat(SIZE_*SIZE_),d_matT(SIZE_*SIZE_),d_matSym(SIZE_*SIZE_))
!
! invoke the kernel
!
!Call
grid_dim = dim3(SIZE_/32, SIZE_/32, 1)
blk_dim = dim3(32, 32, 1)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
call matrixMultiply<<<grid_dim, blk_dim>>>(d_mat, d_matT, d_matSym)
err = cudaDeviceSynchronize()
!
! Free storage for the arrays
!
Deallocate(d_mat,d_matT,d_matSym)
!
END PROGRAM Test
$ nvfortran t11a.cuf -o t11a
$ nvprof ./t11a
==45544== NVPROF is profiling process 45544, command: ./t11a
==45544== Profiling application: ./t11a
==45544== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 10.8169s 2 5.40847s 5.39118s 5.42576s mmk_matrixmultiply_
0.00% 9.2160us 6 1.5360us 1.3120us 2.1440us [CUDA memcpy HtoD]
API calls: 96.72% 10.8240s 9 1.20266s 12.102us 5.42594s cudaFree
3.22% 360.34ms 9 40.038ms 3.2760us 355.64ms cudaMalloc
0.04% 4.3488ms 4 1.0872ms 1.0598ms 1.1593ms cuDeviceTotalMem
0.02% 2.3633ms 404 5.8490us 111ns 869.22us cuDeviceGetAttribute
0.00% 144.50us 4 36.125us 33.254us 41.317us cuDeviceGetName
0.00% 106.43us 6 17.737us 4.7910us 51.453us cudaMemcpy
0.00% 101.46us 2 50.732us 44.479us 56.985us cudaLaunchKernel
0.00% 20.926us 1 20.926us 20.926us 20.926us cudaDeviceSynchronize
0.00% 7.2850us 4 1.8210us 491ns 4.6210us cuDeviceGetPCIBusId
0.00% 1.7650us 8 220ns 118ns 672ns cuDeviceGet
0.00% 926ns 4 231ns 218ns 264ns cuDeviceGetUuid
0.00% 614ns 3 204ns 130ns 310ns cuDeviceGetCount
$
And we see that the performance is approximately the same as the CUDA C++ version.
Notes:
A. I've omitted the SASS analysis for the final variant. Exercise is left to the reader. However the unrolled loop portion is a repeating sequence of LDG-LDG-DFMA just as is witnessed in the CUDA C++ version.
B. You might have two questions: 1. Why is it this way? I've tried to answer that. 2. Should it be this way/does it need to be this way? I have not answered that, I believe that is a combination of some Fortran statements that I don't wish to make, as well as a question for the compiler engineers. If you think the 2D realization should duplicate the 1D realization peformance, I suggest filing a bug.
C. It was a little while of head-scratching before I noticed you had done this in the C++ case but not the fortran case:
int tx = blockIdx.y * blockDim.y + threadIdx.y;
^ ^ ^ ^
D. One of the reasons why I think this discrepancy is not trivially resolvable at compile-time is that the compiler 2D array handling in CUDA Fortran (whether needed or not) seems to involve the creation of a run time meta-data package which includes the array width, perhaps amongst other information. This meta-data package is transmitted to the device separately from the array data payload - this much can be surmised from the profiler output above, or by running with --print-gpu-trace
to see it more clearly. In your CUDA C++ case, effectively, the array width (SIZE
) is known at compile-time. In the Fortran case, it is not (or alternatively, if it is, the compiler does not seem to be propagating this knowledge through the kernel code generation). You can see that the "improved" addressing mode above suggests offsets that are known at compile-time. Whether or not this "should" be resolvable with some more careful treatment of Fortran I cannot say. However I would expect that some sort of contiguous
declaration/decoration, by itself is not sufficient to resolve this discrepancy for the compiler. That declaration alone would not obviate the need for on-the fly runtime indexing calculations, based on array width.
Upvotes: 4