Incorrect results for CUDA Matrix Multiplication

Question

Let me start off by apologizing for this post. I know there have been several posts asking the same question as I will here, but I've tried the solutions that were given and I'm still not getting correct results for CUDA matrix multiplication.

From examples I've followed, I'm pretty sure my algorithm within the kernel is correct. I don't believe I'm have any trouble passing the 2D arrays to the kernel, and as they're passed by reference, I feel like the 2D solution array should contain the correct answers by the time the array is printed in the host, but it doesn't.

Could it be an issue with my dim3 dimGrid(B, B) and dim3 dimThreads(T, T) variables? I'm new to the CUDA framework and am still trying to wrap my head around it. Any suggestions would be very greatly appreciated. My code is as follows:

#include 
#include 
#include 

__global__ void MatMultiply (int *a, int *b, int *c, int N) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int val = 0;

    for (int e = 0; e < N; ++e) {
        val += a[row*N + e] * b[e*N + col];
    }
    c[row*N+col] = val;
}

int main(void) {
    int N, B, T;

    printf("Input integer for matrix dimension size: ");
    scanf("%d", &N);

    printf("Input number of threads in a block: ");
    scanf("%d", &T);

    printf("Input number of blocks in a grid: ");
    scanf("%d", &B);

    int size = N * N * sizeof(int);

    int *a, *b, *c;

    a = (int*)malloc(size);
    b = (int*)malloc(size);
    c = (int*)malloc(size);

    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            a[i*N+j] = j + i*N;
            b[i*N+j] = j + i*N;
            c[i*N+j] = j + i*N;
        }
    }

    int *dev_a, *dev_b, *dev_c;

    cudaMalloc((void**)&dev_a, size);
    cudaMalloc((void**)&dev_b, size);
    cudaMalloc((void**)&dev_c, size);

    cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice);

    dim3 dimGrid(B, B);
    dim3 dimThreads(T, T);
    MatMultiply<<>>(dev_a,dev_b,dev_c, N);

    cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);


    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            printf("%d	", b[i*N + j]);
        }
        printf("
");
    }

    free(a);
    free(b);
    free(c);

    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);

    return 0;
}

Thanks again.

tera · Accepted Answer

You are not using the dimGrid and dimThreads variables in the kernel call. Instead you are just launching a one-dimensional grid of one-dimensional thread-blocks.

Apart from that you are not checking for any errors.

Incorrect results for CUDA Matrix Multiplication

Answers (2)

Related Questions