Reputation: 259
Update!
My current code doesn't check for out of bounds memory access. When I run the cuda memcheck, it says memory access is bad even for matrices of just 2 by 2! I'm accessing memory where I shouldn't somehow and that's the problem!
To check for out of bounds memory access, run cuda-memcheck ./(insert executable here)
Shown below is my code for the matrix multiplication itself:
dim3 block(32,32);
dim3 grid( (n+31)/32, (n+31)/32 );
matrixMul<<<grid,block>>>(d_C, d_A, d_B, n, k);
kA and kB are matrices with values in them (they're all 2's to make it easier).
m, n, k are all the same number for my square matrices
kC is the matrix to store the answer.
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
#include <stdio.h>
__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{
int tx = blockIdx.x * 32 + threadIdx.x;
int ty = blockIdx.y * 32 + threadIdx.y;
float value = 0;
for (int i=0;i<n;i++)
{
float elementA=kA[ty*n+i];
float elementB=kB[i*k+tx];
value += elementA*elementB;
}
kC[ty*n+tx] = value;
}
#endif // #ifndef _MATRIXMUL_KERNEL_H_
Upvotes: 1
Views: 527
Reputation: 152173
Based on how you are defining the grid of threads, you should add a thread check to the kernel code like this:
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
#include <stdio.h>
__global__ void matrixMul(float *kC, float *kA, float *kB, int n, int k)
{
int tx = blockIdx.x * 32 + threadIdx.x;
int ty = blockIdx.y * 32 + threadIdx.y;
if ((ty < n) && (tx < n)) { // add this line
float value = 0;
for (int i=0;i<n;i++)
{
float elementA=kA[ty*n+i];
float elementB=kB[i*k+tx];
value += elementA*elementB;
}
kC[ty*n+tx] = value;
} // add this line
}
#endif // #ifndef _MATRIXMUL_KERNEL_H_
Otherwise threads outside the valid array array will corrupt your results. Things work for multiples of 32x32 because there are no invalid threads. In that case you're launching exactly the required number of threads. But in other cases you are launching extra threads. These extra threads, if allowed to compute an invalid matrix position, will corrupt the results.
Upvotes: 2