Reputation: 3517
I would like to understand how to convert basic C/C++ loops to a CUDA kernel. Let's put it simple:
for (int i=0;i < MAXi;i++)
for(int j=0;j< MAXj;j++){
...code that uses i and j....
}
Every single i would need to compute MAXj elements. It could be very basic for some people but I am really struggling here. Let's say that Maxj is around a million, MAXj=1000000; and there is where we want all threads work. I have been successful with only the inner loop:
int tid=threadIdx.x + blockDim.x*blockIdx.x + blockDim.x*gridDim.x*blockIdx.y;
using 2d blocks, how can I parallelize this kind of loops? They are very common in C and it would be very useful to learn how to do it.
Upvotes: 2
Views: 5418
Reputation: 6584
one best way to divide these kinds of 2D loops is by using 1D blocks and Grids
dim3 blocks(MAXj, 1);
dim3 grids(MAXi, 1);
kernel<<<grids, blocks, 1>>>()
__global__ kernel()
{
int i = blockIdx.x;
int j = threadIdx.x;
...code that uses i and j....
}
The inner loop is been divided into threads and outer loop is divided into blocks (2D blocks)
if MAXj and MAXi are very large values, then you need to divide it into small groups and compute it. The code is quite similar to the one posted in this thread.
Upvotes: 3