Reputation: 37
Let me start off by apologizing for this post. I know there have been several posts asking the same question as I will here, but I've tried the solutions that were given and I'm still not getting correct results for CUDA matrix multiplication.
From examples I've followed, I'm pretty sure my algorithm within the kernel is correct. I don't believe I'm have any trouble passing the 2D arrays to the kernel, and as they're passed by reference, I feel like the 2D solution array should contain the correct answers by the time the array is printed in the host, but it doesn't.
Could it be an issue with my dim3 dimGrid(B, B) and dim3 dimThreads(T, T) variables? I'm new to the CUDA framework and am still trying to wrap my head around it. Any suggestions would be very greatly appreciated. My code is as follows:
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
__global__ void MatMultiply (int *a, int *b, int *c, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int val = 0;
for (int e = 0; e < N; ++e) {
val += a[row*N + e] * b[e*N + col];
}
c[row*N+col] = val;
}
int main(void) {
int N, B, T;
printf("Input integer for matrix dimension size: ");
scanf("%d", &N);
printf("Input number of threads in a block: ");
scanf("%d", &T);
printf("Input number of blocks in a grid: ");
scanf("%d", &B);
int size = N * N * sizeof(int);
int *a, *b, *c;
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
a[i*N+j] = j + i*N;
b[i*N+j] = j + i*N;
c[i*N+j] = j + i*N;
}
}
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c, size, cudaMemcpyHostToDevice);
dim3 dimGrid(B, B);
dim3 dimThreads(T, T);
MatMultiply<<<B, T>>>(dev_a,dev_b,dev_c, N);
cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
printf("%d\t", b[i*N + j]);
}
printf("\n");
}
free(a);
free(b);
free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
Thanks again.
Upvotes: 0
Views: 746
Reputation: 109
So, the problem here seems to be in setting up threads and blocks and using threadIdx, blockDim and gridDim.
NOTE: practical solution to this particular problem at the label Practical solution
threadIdx is as the name says the ID of the thread. That means this value, or more preciselly it's threadIdx.x and threadIdx.y components will go from values of 0 to specified thread count or rather threads per block value which is stored in blockDim.x and blockDim.y . For example a call
someKernel<<<1,32>>>( .... );
would result in threadIdx.x going from values of 0 to 31 and threadIdx.y would not be iterated at all (i presume it would always be 0).
If you however define a cuda specific structure dim3 and call it threadsPerBlock , and then use it as the second argument like this:
dim3 threadsPerBlock( 32, 32 );
someKernel<<<1,threadsPerBlock>>>( .... );
then you would get both threadIdx.x and threadIdx.y to go from 0 to 31 getting all kinds of combinations of them in the kernel execution.
Note that you are restricted to a certain maximum number of threads per block launched. This number is different for different graphic cards, or more precisely, the compute capability they support. Look for these numbers in the table at the end of this link So, compute capability 2.x and up supports a maximum of 1024 threads per block, while earlier versions support 512. Note also that this means a maximum of 32x32 threads per block when launching in 2 dimensions.
But what if you need more than that? Well son, then you launch more blocks! You can also launch blocks in 1 or 2 dimensions. For example
dim3 threadsPerBlock( 32, 32 );
dim3 blocksPerGrid ( 256, 265 );
someKernel <<<blocksPerGrid,threadsPerBlock>>>( ... );
the size of the grid is stored in gridDim structure and in this case both gridDim.x and gridDim.y would be 256, making the blockIdx.x and blockIdx.y variables go from 0 to 255.
Practical solution:
Now that we know this, lets take a look at your code. In your code if you for example set T to be 32 and B to be 256, you would effectively get this:
threadIdx.x would go from 0 to 31
threadIdx.y would go from 0 to 0
blockIdx.x would go from 0 to 255
blockIdx.y would go from 0 to 0
blockDim.x would be 32
blockDim.y would be 1
gridDim.x would be 256
gridDim.y would be 1
Now lets see how your variables react to this...
row would go from 0 to 0
col would go from 0 to 1023
So, this is presumably not really what you want. You want both your row and col to go from 0 to N-1 right? Well, this is how you do it:
int row = threadIdx.x + blockIdx.x * blockDim.x;
int col = threadIdx.y + blockIdx.y * blockDim.y;
Also make sure that you have enough threads to cover the dimensions of the matrix. That is make sure that you set *threadsPerBlock*blocksPerGrid* to be greater than your N. This is usually best done this way:
threads = 32
dim3 threadsPerBlock ( threads, threads );
blocks = (N / threads) + 1;
dim3 blocksPerGrid ( blocks, blocks );
"But if I make it greater than N, then I might have some threads that I dont need" - say you - "I don't want them to do work!" And wise you are sir, to say that. You solve this by simple if clause in which you will enclose your calculations, like so:
if ( row < N && col < N )
{
// your add... err... code here
}
Hope that helps. Enjoy CUDA ;)
Upvotes: 3
Reputation: 7265
You are not using the dimGrid
and dimThreads
variables in the kernel call. Instead you are just launching a one-dimensional grid of one-dimensional thread-blocks.
Apart from that you are not checking for any errors.
Upvotes: 2