Reputation: 1023
I want to find out how the number of threads in a block affects the performance and speed of a cuda program. I wrote a simple vector addition code, here is my code:
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void gpuVecAdd(float *a, float *b, float *c, int n) {
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < n) {
c[id] = a[id] + b[id];
}
}
int main() {
int n = 1000000;
float *h_a, *h_b, *h_c, *t;
srand(time(NULL));
size_t bytes = n* sizeof(float);
h_a = (float*) malloc(bytes);
h_b = (float*) malloc(bytes);
h_c = (float*) malloc(bytes);
for (int i=0; i<n; i++)
{
h_a[i] =rand()%10;
h_b[i] =rand()%10;
}
float *d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
gpuErrchk( cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
gpuErrchk( cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice));
clock_t t1,t2;
t1 = clock();
int block_size = 1024;
gpuVecAdd<<<ceil(float(n/block_size)),block_size>>>(d_a, d_b, d_c, n);
gpuErrchk( cudaPeekAtLastError() );
t2 = clock();
cout<<(float)(t2-t1)/CLOCKS_PER_SEC<<" seconds";
gpuErrchk(cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost));
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
free(h_a);
free(h_b);
free(h_c);
}
I read this post and Based on the talonmies' answer "The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware."
I checked the code with a different number of threads per block, for example, 2 and 1024 (which is the multiply of 32 and also the maximum number of thread per block). The average running time for both sizes is almost equal and I don't see a huge difference between them. Why is that? Is my benchmarking incorrect?
Upvotes: 0
Views: 80
Reputation: 151799
GPU kernel launches in CUDA are asynchronous. This means that control will be returned to the CPU thread before the kernel has finished executing.
If we want the CPU thread to time the duration of the kernel, we must cause the CPU thread to wait until the kernel has finished. We can do this by putting a call to cudaDeviceSynchronize()
in the timing region. Then the measured time will include the full duration of kernel execution.
Upvotes: 1