Cuda matrix multiplication gives wrong answer

Question

Update!

My current code doesn't check for out of bounds memory access. When I run the cuda memcheck, it says memory access is bad even for matrices of just 2 by 2! I'm accessing memory where I shouldn't somehow and that's the problem!

To check for out of bounds memory access, run cuda-memcheck ./(insert executable here)

Shown below is my code for the matrix multiplication itself:

dim3 block(32,32);
dim3 grid( (n+31)/32, (n+31)/32 );
matrixMul<<>>(d_C, d_A, d_B, n, k);

kA and kB are matrices with values in them (they're all 2's to make it easier).

m, n, k are all the same number for my square matrices

kC is the matrix to store the answer.

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include 

__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{

    int tx = blockIdx.x * 32 + threadIdx.x;
    int ty = blockIdx.y * 32 + threadIdx.y;
    float value = 0;

    for (int i=0;i

Robert Crovella · Accepted Answer

Based on how you are defining the grid of threads, you should add a thread check to the kernel code like this:

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include 

__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{

    int tx = blockIdx.x * 32 + threadIdx.x;
    int ty = blockIdx.y * 32 + threadIdx.y;

    if ((ty < n) && (tx < n)) { // add this line
      float value = 0;

      for (int i=0;i



Otherwise threads outside the valid array array will corrupt your results. Things work for multiples of 32x32 because there are no invalid threads. In that case you're launching exactly the required number of threads. But in other cases you are launching extra threads.  These extra threads, if allowed to compute an invalid matrix position, will corrupt the results.

Cuda matrix multiplication gives wrong answer

Answers (1)

Related Questions