Mechy
Mechy

Reputation: 259

Cuda matrix multiplication gives wrong answer

Update!

My current code doesn't check for out of bounds memory access. When I run the cuda memcheck, it says memory access is bad even for matrices of just 2 by 2! I'm accessing memory where I shouldn't somehow and that's the problem!

To check for out of bounds memory access, run cuda-memcheck ./(insert executable here)

Shown below is my code for the matrix multiplication itself:

dim3 block(32,32);
dim3 grid( (n+31)/32, (n+31)/32 );
matrixMul<<<grid,block>>>(d_C, d_A, d_B, n, k);

kA and kB are matrices with values in them (they're all 2's to make it easier).

m, n, k are all the same number for my square matrices

kC is the matrix to store the answer.

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{

    int tx = blockIdx.x * 32 + threadIdx.x;
    int ty = blockIdx.y * 32 + threadIdx.y;
    float value = 0;

    for (int i=0;i<n;i++)
    {
        float elementA=kA[ty*n+i];
        float elementB=kB[i*k+tx];
        value += elementA*elementB;
    }

    kC[ty*n+tx] = value;
}

#endif // #ifndef _MATRIXMUL_KERNEL_H_

Upvotes: 1

Views: 527

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152173

Based on how you are defining the grid of threads, you should add a thread check to the kernel code like this:

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{

    int tx = blockIdx.x * 32 + threadIdx.x;
    int ty = blockIdx.y * 32 + threadIdx.y;

    if ((ty < n) && (tx < n)) { // add this line
      float value = 0;

      for (int i=0;i<n;i++)
      {
        float elementA=kA[ty*n+i];
        float elementB=kB[i*k+tx];
        value += elementA*elementB;
      }

      kC[ty*n+tx] = value;
    }  //  add this line
}

#endif // #ifndef _MATRIXMUL_KERNEL_H_

Otherwise threads outside the valid array array will corrupt your results. Things work for multiples of 32x32 because there are no invalid threads. In that case you're launching exactly the required number of threads. But in other cases you are launching extra threads. These extra threads, if allowed to compute an invalid matrix position, will corrupt the results.

Upvotes: 2

Related Questions