Reputation: 183
Is a bad idea to put a for loop in a kernel?
or is it a common thing to do?
Upvotes: 11
Views: 19078
Reputation: 5544
Upvotes: 1
Reputation: 28292
It's common to put loops into kernels. It doesn't mean it's always a good idea, but it doesn't mean that it's not, either.
The general problem of deciding exactly how to effectively distribute your tasks and data and exploit related parallelisms is a very hard and unsolved problem, especially when it comes to CUDA. Active research is being carried out to determine efficiently (i.e., without blindly exploring the parameter space) how to achieve best results for given kernels.
Sometimes, it can make a lot of sense to put loops into kernels. For instance, iterative computations on many elements of a large, regular data structure exhibiting strong data independence are ideally suited to kernels containing loops. Other times, you may decide to have each thread process many data points, if e.g. you'd not have enough shared memory to allocate one thread per task (this isn't uncommon when a large number of threads share a large amount of data, and by increasing the amount of work done per thread, you can fit all the threads' shared data in shared memory).
Your best bet is to make an educated guess, test, profile, and revise as you need. There's a lot of room to play around with optimizations... launch parameters, global vs. constant vs. shared memory, keeping the number of registers cool, ensuring coalescing and avoiding memory bank conflicts, etc. If you're interested in performance, you should check out the "CUDA C Best Practices" and "CUDA Occupancy Calculator" available from NVIDIA on the CUDA 4.0 documentation page (if you haven't already).
Upvotes: 9
Reputation: 6675
As long as it's not at the top level, you should probably be OK. Doing so at the top level would negate all the advantages of CUDA.
As Dan points out, memory accesses become an issue. One way around this is to load the referenced memory either into shared memory or texture memory if it doesn't fit in shared. The reason is that uncoalesced global memory accesses are very slow (~400 clock cycles rather ~40 of the shared memory).
Upvotes: 0
Reputation: 13383
It's generally okay if you're careful about your memory access patterns. If the for loop will access memory at random leading to many uncoalesced memory reads it could be very slow.
In fact, I once had a piece of code run slower with CUDA because I naively stuck a for loop in the kernel. However, once I had thought about memory access, by for example loading a chunk at a time into shared so each thread block could do a part of the for loop at the same time from shared, it was much quicker.
Upvotes: 4