Reputation: 29
have a problem making a Matrix Multiplication using cuda. I have to do A*A*A*A and save it in hB. With Cublas it's ok, but I can't make it with CUDA. Dimension can be a high value like 2000. This is my code:
__global__ void CudaMM(float *A, float *B, int N)
{
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
float sum = 0.f;
for (int n = 0; n < N; ++n)
sum += A[row*N+n]*A[n*N+col];
B[row*N+col] = sum;
}
void CudaMult(int dimension,float *hMatrice,float *hB,float *d_A,float *d_B){
int N,K;
K = 100;
N = K*BLOCK_SIZE;
dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE);
dim3 grid(K,K);
cudaMemcpy(d_A,hMatrice,dimension*dimension*sizeof(float),cudaMemcpyHostToDevice);
CudaMM<<<grid,threadBlock>>>(d_A,d_B,N);
cudaMemcpy(hB,d_B,dimension*dimension*sizeof(float),cudaMemcpyDeviceToHost);
}
void CublasFindConnect(int dimension,float* mat,float* B){
float *d_A,*d_B;
cudaMalloc(&d_A,dimension*dimension*sizeof(float));
cudaMalloc(&d_B,dimension*dimension*sizeof(float));
int w=0;
while(w<5){
CudaMult(dimension,mat,B,d_A,d_B);
// Copy Matrix computed B to previous M
for (m=0; m<dimension; m++) {
for (n=0; n<dimension; n++) {
mat[m*dimension+n]=B[m*dimension+n];
B[m*dimension+n]=0;
}
}
w++;
}
cudaFree(d_A);
cudaFree(d_B);
}
I've installed last CUDA 6 that it doesn't require cudaMemCpy, because memory is shared.
Upvotes: 0
Views: 1635
Reputation: 152173
BLOCK_SIZE
? The idea is not to tell me what BLOCK_SIZE
is, but to show a complete code.cudaMallocManaged()
) that you're not meeting, but nevertheless your code is not dependent on Unified Memory, so it's irrelevant.One problem I can see in your code is that your dimension
variable is arbitrary (you say it can be up to a large number like 2000) but your computation size is fixed at N=K*BLOCK_SIZE;
. Presumably if your BLOCK_SIZE is some value like 16 or 32, then it will meet your approximate max dimension
size of ~2000.
The problem arises because your grid size is potentially larger than your valid array size. You are launching an N
xN
grid, but N
can be larger than dimension
. This means some of the launched threads can attempt to access the matrices (A
and B
) outside of their valid dimensions.
You can fix this with a "thread check" in your kernel, something like this:
__global__ void CudaMM(float *A, float *B, int N)
{
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
if ((row < N) && (col < N)) {
float sum = 0.f;
for (int n = 0; n < N; ++n)
sum += A[row*N+n]*A[n*N+col];
B[row*N+col] = sum;
}
}
and you will need to modify your kernel invocation to:
CudaMM<<<grid,threadBlock>>>(d_A,d_B,dimension);
You might also want to consider choosing grid sizes based on your actual dimension
, rather than fixed at 100*BLOCK_SIZE
, but that is not essential to get the code to work.
Upvotes: 1