Reputation: 1083
I am timing how long it takes my CUDA program to calculate matrices of a certain size. For example, 10x10, 100x100, 500x500,100x1000.
However, the results are not at all what I was expecting. The numbers for the graph are not at what is expected. With the increase in size of the matrices, the computational time decreases.
For example, here is the average time (from 1000 runs): 10x10: 0.032768s 100x100: 0.068960s 500x500: 0.006336s 1000x1000: 0.018400s
The time goes down, then up again at 1000. What is going on? Shouldn't the numbers peak off at a certain point? Why is it going in a roller coaster like this?
Here is how the actual timing code is being run:
int blocksNeeded=0;
cudaError_t cudaStatus;
blocksNeeded=(size/MAXTHREADS)+1;
int threadsPerBlock = MAXTHREADS/blocksNeeded+1;
cudaEvent_t start, stop;
float elapsedtime;
.
.
.
.
.
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
addKernel<<<blocksNeeded, size>>>(dev_c, dev_a, dev_b,size);
cudaStatus = cudaDeviceSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
where MAXTHREADS are 1024 and, size is the amount of elements I have in the matrix. I.E. 10x10 matrix will have 100 elements which is the size.
Updated with kernel:
__global__ void addKernel(float *c, float *a, float *b,int size)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if(idx < size)
c[idx] = a[idx] + b[idx];
}
Upvotes: 1
Views: 2033
Reputation: 8154
I've made a test on a recent GPU cluster equipped with NVIDIA Tesla M2090. Basically i'm performing a vector addition with different sizes. The results are:
Size Kernel time (msec)
===========================
2 0.04
4 0.010912
8 0.012128
16 0.012256
32 0.011296
64 0.01248
128 0.012192
256 0.012576
512 0.012416
1024 0.012736
2048 0.01232
4096 0.011968
8192 0.011264
16384 0.007296
32768 0.007776
65536 0.009728
131072 0.018304
262144 0.031392
524288 0.055168
1048576 0.10352
What you can see is, that there is knee at a vector size of 16384, which basically resembles your observations. This is not an error but normal behavior since the GPU has to be utilized for showing performance. The point of utilization is, in case of the Tesla M2090, reached around 16384 parallel additions.
The way you are measuring kernel performance is perfectly ok. I assume you've taken this from the "Best Practices Guide" for CUDA.
Notice: Please consider that the shown data is generated by using a single kernel run, i. e. it is not representative. Generally for exact time measurements the kernel should run multiple times with the same problem and the kernel time is the mean of the runs.
Upvotes: 4
Reputation: 2053
You must call the kernel with
addKernel<<<blocksNeeded, MAXTHREADS>>>(dev_c, dev_a, dev_b,size);
The second parameter on a kernel call is the number of threads to launch in each block, not the total number of threads.
At 100x100 you are already exceeding the maximum number of threads per block which is 1536 for compute capability 2.x
And just noticed that you calculate some kind of threadsPerBlock which is wrong and that you don't use it. Choose a number of threads per block. Then divide by the total number of elements to process and add 1 to it if the remainder is different from 0 and you get the number of blocks to launch.
Upvotes: 0