Reputation: 15481
EDITED to correspond with current state after linked question.
I am currently trying to reimplement basic Matrix multiplication in CUDA, and while my code works fine for Square matrices, and Rectangular Matrices whose dimensions are multiples of 8, it does not appear to function for Rectangular Matrices, whose dimensions are not multiples of 8.
The following is my Kernel multiplication function:
__global__ void matrixMultiply(float * A, float * B, float * C,
int numARows, int numAColumns,
int numBRows, int numBColumns,
int numCRows, int numCColumns) {
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
if (numAColumns != numBRows) return ;
if ((Row < numARows) && (Col < numBColumns)){
float Cvalue = 0;
for (int k = 0 ; k < numAColumns ; ++k )
Cvalue += A[Row*numAColumns + k] * B[k * numBColumns + Col];
C[Row*numCColumns + Col] = Cvalue;
}
}
The following is the memory allocation(for readability I have cut out the error checking):
cudaMalloc((void**) &deviceA, ARows*sizeof(float)*AColumns);
cudaMalloc((void**) &deviceB, BRows*sizeof(float)*BColumns);
cudaMalloc((void**) &deviceC, CRows*sizeof(float)*CColumns);
cudaMemcpy(deviceA, hostA, ARows*sizeof(float)*AColumns, cudaMemcpyHostToDevice);
cudaMemcpy(deviceB, hostB, BRows*sizeof(float)*BColumns, cudaMemcpyHostToDevice);
cudaMemcpy(deviceC, hostC, CRows*sizeof(float)*CColumns, cudaMemcpyHostToDevice);
While the following is the Call:
dim3 dimGrid((int)ceil(numCRows / 8.0) , (int)ceil(numCColumns / 8.0), 1);
dim3 dimBlock(8 , 8, 1);
multiplyMatrices<<<dimGrid,dimBlock>>>(deviceA, deviceB, deviceC, numARows, AColumns, BRows, BColumns, CRows, CColumns);
And finally moving the memory Back: cudaMemcpy(hostC, deviceC, CRows*sizeof(float)*CColumns, cudaMemcpyDeviceToHost);
Now I have traced my algorithm repeatedly, and I do not believe there to be anything wrong with it, so I personally think there might be something wrong with the Block/Grid sizing scheme I've used. If anybody who knows CUDA/C better then I do (Ruby/JavaScript guy here), could take a look at it, and walk me through what exactly it is that I am doing wrong, I would be very very grateful.
Upvotes: 4
Views: 2632
Reputation: 16816
The problem is with the grid size you are creating:
dim3 dimGrid((int)ceil(numCRows / 8.0) , (int)ceil(numCColumns / 8.0), 1);
As rows is the Y dimension of the matrix and columns is the X dimension, so you are actually creating the transposed grid.
To create the correct grid, do the following:
dim3 dimGrid((int)ceil(numCColumns / 8.0) , (int)ceil(numCRows / 8.0), 1);
A better approach is to do the following:
dim3 dimGrid;
dimGrid.x = (numCColumns + dimBlock.x - 1)/dimBlock.x;
dimGrid.y = (numCRows + dimBlock.y - 1)/dimBlock.y;
Upvotes: 3