alukard990
alukard990

Reputation: 851

CUDA and thread blocks overhead

I've implemented a simple kernel that calculate matrix multiplication. This is the code:

__global__ void MatMultKernel(Mat A, Mat B, Mat C){

int i;
double val=0.0;
int ix=blockDim.x*blockIdx.x+threadIdx.x;  
int iy=blockDim.y*blockIdx.y+threadIdx.y;
if(ix<A.nrows && iy<B.nrows){
    for(i=0;i<A.nrows;i++)
        val+=A.value[iy*A.nrows+i]*B.value[i*B.nrows+ix];
    C.value[iy*C.nrows+ix]=val;
 }
}

I measured the running time of this kernel by varying threads and blocks configurations.

I've seen execution times are always worse if threads are allocated in column blocks (for example dim3 (1,256,1)) What is the reason?

Upvotes: 1

Views: 460

Answers (1)

Jakub Klinkovsk&#253;
Jakub Klinkovsk&#253;

Reputation: 1362

First, let me point out that your code works only for square matrices, generally you should use A.ncols instead of A.nrows in all three places in your code.

The performance difference is due to memory access. You have all three matrices stored in row-major format and the code does the following: each thread accesses the iy-th row of A and ix-th column of B and computes their dot product. Note that all threads in a warp always execute the same instruction at the same time, so in the serial loop in your code i is always the same for all threads in a warp. In your code, the block shape matters, because:

  • If the block size is (256, 1, 1), then each thread in a block has the same iy, but different ix. Let's take a look at the access pattern to B: all threads in the same warp always access the same row of B, because in B.value[i*B.nrows+ix] the i is the same and ix differs, so the loads can be coalesced.
  • If the block size is (1, 256, 1), then the situation is transposed, so you'd probably expect the loads from A to be coalesced. But that is not the case, because iy determines the row and the values accessed by two neighboring threads are offset by A.ncols.

The access pattern to C is the same as for B, but much less important. With a 2D block the situation is somewhere between the two 1D cases.

If you want to optimize your code further, you can use the shared memory as shown in the CUDA Programming Guide.

Upvotes: 3

Related Questions