Reputation: 497
I am trying to "map" a few tasks to CUDA GPU. There are n tasks to process. (See the pseudo-code)
malloc an boolean array flag[n] and initialize it as false.
for each work-group in parallel do
while there are still unfinished tasks do
Do something;
for a few j_1, j_2, .. j_m (j_i<k) do
Wait until task j_i is finished; [ while(flag[j_i]) ; ]
Do Something;
end for
Do something;
Mark task k finished; [ flag[k] = true; ]
end while
end for
For some reason, I will have to use threads in different thread block.
The question is how to implement the Wait until task j_i is finished; and Mark task k finished; in CUDA. My implementation is to use an boolean array as the flag. Then set flag once a task is done, and read the flag to check if a task is done.
But it only works on small case, one large case, the GPU get crashed with unknown reason. Is there any better way to implement the Wait and Mark in CUDA.
That's basically a problem of inter-thread communication on CUDA.
Upvotes: 0
Views: 5249
Reputation: 544
I think you dont need to implement in CUDA. Every thing can be implemented on CPU. You are waiting for a task to complete, then doing another task randomly. If you want to implement in CUDA, you dont need to wait for all the flags to be true. You know initially that all the flags are false. So just implement Do something
in parallel for all the thread and change the flag to true.
If you want to implement in CUDA, take int flag
and keep on adding 1 it after finishing Do something
so that you can know the change in flag before and after doing Do something
.
If i got your question wrong, please comment. I'll try to improve the answer.
Upvotes: 1
Reputation: 7255
This kind of problems is best solved by a slightly different approach:
Don't assign fixed tasks to your threads, forcing your threads to wait until their task becomes available (which isn't possible in CUDA since threads can't block).
Instead, keep a list of available tasks (using atomic operations) and have each thread grab a task from that list.
This is still tricky to implement and get the corner cases right, but at least it's possible.
Upvotes: 1
Reputation: 823
What you're asking for is not easily done since thread blocks can be scheduled in any order, and there is no easy way to synchronize or communicate between them. From the CUDA Programming Guide:
For the parallel workloads, at points in the algorithm where parallelism is broken because some threads need to synchronize in order to share data with each other, there are two cases: Either these threads belong to the same block, in which case they should use __syncthreads() and share data through shared memory within the same kernel invocation, or they belong to different blocks, in which case they must share data through global memory using two separate kernel invocations, one for writing to and one for reading from global memory. The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic. Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible.
So if you can't fit all the communication you need within a thread block, you would need to have multiple kernel calls in order to accomplish what you want.
I don't believe there is any difference with OpenCL, but I also don't work in OpenCL.
Upvotes: 1
Reputation: 21108
Synchronising within a threadblock is straightforward using __syncthreads()
. However synchronising between threadblocks is more tricky - the programming model method is to break into two kernels.
If you think about it, it makes sense. The execution model (for both CUDA and OpenCL) is for a whole bunch of blocks executing on processing units, but says nothing about when. This means that some blocks will be executing but others will not (they'll be waiting). So if you have a __syncblocks()
then you would risk deadlock, since those already executing will stop, but those not executing will never reach the barrier.
You can share information between blocks (using global memory and atomics, for example), but not global synchronisation.
Depending on what you're trying to do, there is frequently another way of solving or breaking down the problem.
Upvotes: 2