Kamil T
Kamil T

Reputation: 2216

Matrices multiplying gives wrong results on CUDA

I've written a small piece of code using CUDA to multiply 2 square matrices. Hovewer, it turns out that most of cells are miscalculated. According to the tutorial I've used, everything should go fine.

__global__ void gpuMM(int *C, int *A, int *B, int N)
{
    int row = blockIdx.x*blockDim.x + threadIdx.x;
    int col = blockIdx.y*blockDim.y + threadIdx.y;
    int sum = 0;
    for (int n = 0; n < N; ++n)
        sum += A[row*N+n]*B[n*N+col];

    C[row*N+col] = sum;
}

#define ROW_SIZE 5
#define MATRIX_LENGTH ROW_SIZE*ROW_SIZE
#define BLOCK_SIZE 16

void MultiplyMatrixCUDA(int * pResult, int* pFactorA, int*pFactorB)
{
    int size = MATRIX_LENGTH*sizeof(int);
    int *dA,*dB,*dC;
    cudaMalloc(&dA,size);
    cudaMalloc(&dB,size);
    cudaMalloc(&dC,size);
    int K = 100;
    dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE);
    dim3 grid(K,K);

    printf("A:\n");
    DrawMatrix(pFactorA);
    printf("\n");

    printf("B:\n");
    DrawMatrix(pFactorB);
    printf("\n");

    // Copy matrices from the host to device
    cudaMemcpy(dA,pFactorA,size,cudaMemcpyHostToDevice);
    cudaMemcpy(dB,pFactorB,size,cudaMemcpyHostToDevice);

    //Execute the matrix multiplication kernel
    gpuMM<<<grid,threadBlock>>>(dC,dA,dB,ROW_SIZE);


    // Allocate memory to store the GPU answer on the host
    int *C;
    C = new int[MATRIX_LENGTH];

    // Now copy the GPU result back to CPU
    cudaMemcpy(C,dC,size,cudaMemcpyDeviceToHost);

    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);

    printf("\nC from CUDA:\n");
    DrawMatrix(C);
    printf("\nC:\n");
    DrawMatrix(MultiplyWithCPU(pResult,pFactorA, pFactorB));  // the code of multiplying function is irrevelant, I'm sure it works fine (double-checked)

}

The result shows that matrices multiplied with standard CPU methods is correct, but the CUDA one is wrong: enter image description here enter image description here

The first row is always correct, but all the other parts are completely random. Sometimes they are negative, sometimes not. Sometimes they are close to real values, sometimes they are COMPLETELY different.

What is my mistake? I don't see where is the fail. The algorithm looks fine, variables seem to be passed correctly, yet something doesn't work.

--- EDIT

All the variables (pResult and both pFactors) are initialized (and later deleted) in other parts of code.

Upvotes: 1

Views: 1106

Answers (1)

srodrb
srodrb

Reputation: 1344

Since the number of threads per block is not equal to the number of elements in the output matrix (you're mapping a 5x5 matrix on a 16x16 block) so some threads are accessing/writing invalid memory positions.

The solution is including a double boundary check in order to solve the problem. This will cause some threads to be idle. The kernel should looks like this:

__global__ void gpuMM(int *C, int *A, int *B, int N)
{
    int row = blockIdx.x*blockDim.x + threadIdx.x;
    int col = blockIdx.y*blockDim.y + threadIdx.y;

    if( (row < N) && (col < N))
    {
        int sum = 0;
        for (int n = 0; n < N; ++n){
            sum += A[row*N+n]*B[n*N+col];       
        }
        C[row*N+col] = sum;
    }
}

Another solution -more effective, depending on your device, indeed- is launching less threads per block (25 in this case).

Upvotes: 4

Related Questions