Reputation: 25
In my code, I need to call a CUDA kernel to parallelize some matrix computation. However, this computation must be done iteratively for ~60,000 times (kernel is called inside a 60,000 iteration for loop).
That means, If I do cudaMalloc/cudaMemcpy across every single call to the kernel, most of the time will be spent doing memory allocation and transfer and I get a significant slowdown.
Is there a way to say, allocate a piece of memory before the for loop, use that memory in each iteration of the kernel, and then after the for loop, copy that memory back from device to host?
Thanks.
Upvotes: 0
Views: 121
Reputation: 151799
Yes, you can do exactly what you describe:
int *h_data, *d_data;
cudaMalloc((void **)&d_data, DSIZE*sizeof(int));
h_data = (int *)malloc(DSIZE*sizeof(int));
// fill up h_data[] with data
cudaMemcpy(d_data, h_data, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
for (int i = 0; i < 60000; i++)
my_kernel<<<grid_dim, block_dim>>>(d_data)
cudaMemcpy(h_data, d_data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
...
Upvotes: 2