Reputation: 69
I would like some help with the following.
I have a kernel function that has arrays as inputs and after calculations one of those arrays changes its values. I'll call it dev_array.
I want dev_array to be used again as an input of my kernel calculations about 80 times so I can have the correct result of dev_array which is used later in my main.
How can I do that? I have already tried using a loop inside the kernel before the thread condition.
while(i<80){
i++;
if(tidx<N){`
//calculating dev_array then using it again at the beginning of while
}
}
but it didn't work. It looked like it was in an infinite loop.
Calling kernel from main 80 times and memcopying all the time is not satisfactory.
while(i<80){
i++;
cudaMemcpy(dev_array,cudaMemcpyHostToDevice);
kernel<<<grid,block>>>(dev_array);
cudaMemcpy(dev_array,cudaMemcpyDeviceToHost);
}
Thank you for any help
Upvotes: 0
Views: 306
Reputation: 1200
You don't have to memcopy all the time. After the execution of the kernel the data stay in the device memory and the dev_array
pointer points to the correct point all the time.
You can just loop the kernel to iterate how many times you want to run the function or even pass the result to a second kernel.
When calling kernels one after the other you can assure that they are executed in a queue and have the synchronization you need. This is valid when the kernels belong to the same Cuda stream. Here you can learn more about streams and how they work.
If you can manage a way to have synchronization within your kernel and use the for
loop, it's quicker only if for example you use __shared__
memory and you can avoid reading and copying to the (slower) global memory all the time. But there is not a way to put a barrier to all the blocks if you want avoid reading and writing conflicts.
Only __syncthreads()
can be used for the threads within a block.
Upvotes: 1