Reputation: 2216
I've written a small piece of code using CUDA to multiply 2 square matrices. Hovewer, it turns out that most of cells are miscalculated. According to the tutorial I've used, everything should go fine.
__global__ void gpuMM(int *C, int *A, int *B, int N)
{
int row = blockIdx.x*blockDim.x + threadIdx.x;
int col = blockIdx.y*blockDim.y + threadIdx.y;
int sum = 0;
for (int n = 0; n < N; ++n)
sum += A[row*N+n]*B[n*N+col];
C[row*N+col] = sum;
}
#define ROW_SIZE 5
#define MATRIX_LENGTH ROW_SIZE*ROW_SIZE
#define BLOCK_SIZE 16
void MultiplyMatrixCUDA(int * pResult, int* pFactorA, int*pFactorB)
{
int size = MATRIX_LENGTH*sizeof(int);
int *dA,*dB,*dC;
cudaMalloc(&dA,size);
cudaMalloc(&dB,size);
cudaMalloc(&dC,size);
int K = 100;
dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 grid(K,K);
printf("A:\n");
DrawMatrix(pFactorA);
printf("\n");
printf("B:\n");
DrawMatrix(pFactorB);
printf("\n");
// Copy matrices from the host to device
cudaMemcpy(dA,pFactorA,size,cudaMemcpyHostToDevice);
cudaMemcpy(dB,pFactorB,size,cudaMemcpyHostToDevice);
//Execute the matrix multiplication kernel
gpuMM<<<grid,threadBlock>>>(dC,dA,dB,ROW_SIZE);
// Allocate memory to store the GPU answer on the host
int *C;
C = new int[MATRIX_LENGTH];
// Now copy the GPU result back to CPU
cudaMemcpy(C,dC,size,cudaMemcpyDeviceToHost);
cudaFree(dA);
cudaFree(dB);
cudaFree(dC);
printf("\nC from CUDA:\n");
DrawMatrix(C);
printf("\nC:\n");
DrawMatrix(MultiplyWithCPU(pResult,pFactorA, pFactorB)); // the code of multiplying function is irrevelant, I'm sure it works fine (double-checked)
}
The result shows that matrices multiplied with standard CPU methods is correct, but the CUDA one is wrong:
The first row is always correct, but all the other parts are completely random. Sometimes they are negative, sometimes not. Sometimes they are close to real values, sometimes they are COMPLETELY different.
What is my mistake? I don't see where is the fail. The algorithm looks fine, variables seem to be passed correctly, yet something doesn't work.
--- EDIT
All the variables (pResult and both pFactors) are initialized (and later deleted) in other parts of code.
Upvotes: 1
Views: 1106
Reputation: 1344
Since the number of threads per block is not equal to the number of elements in the output matrix (you're mapping a 5x5 matrix on a 16x16 block) so some threads are accessing/writing invalid memory positions.
The solution is including a double boundary check in order to solve the problem. This will cause some threads to be idle. The kernel should looks like this:
__global__ void gpuMM(int *C, int *A, int *B, int N)
{
int row = blockIdx.x*blockDim.x + threadIdx.x;
int col = blockIdx.y*blockDim.y + threadIdx.y;
if( (row < N) && (col < N))
{
int sum = 0;
for (int n = 0; n < N; ++n){
sum += A[row*N+n]*B[n*N+col];
}
C[row*N+col] = sum;
}
}
Another solution -more effective, depending on your device, indeed- is launching less threads per block (25 in this case).
Upvotes: 4