Reputation: 7906
All this CUDA lark is head-melting in its power, but something I've been wondering about is the hard-limits on 1d block/grid dimensions (usually 512/65535 respectively).
When dealing with problems that are much larger in their scope (in the order of billions), is there an automated programmatic way of effectively setting a 'queue' through a kernel, or is it a case of manual slicing and dicing?
How does everyone deal with problem-partitioning?
Upvotes: 2
Views: 1377
Reputation: 78478
There are 2 basic ways of partitioning your data, so that you can work on it using CUDA:
I have explained these techniques with simple examples here. Method 2 is typically easier to code and work with for most tasks.
Upvotes: 1
Reputation: 72349
If one dimensional grids are too small, just use two dimensional (or three dimensional on Fermi with CUDA 4.0) grids instead. Dimensionality in grid and block layouts is really only for convenience - it makes the execution space look like the sort of common data parallel inputs spaces programmers are used to working with (matrices, grids, voxels, etc). But it is only a very small abstraction away from the underlying simple linear numbering scheme that can handle over 10^12 unique thread IDs within a single kernel launch.
In grids, ordering is column major, so if you had a 1D grid problem before, the "unique, 1D thread index" was calculated as:
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
which has a theoretical upper limit of 512 * 65535 = 33553920 unique threads. The equivalent 2D grid problem is only a simple extension of the 1D case
size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;
size_t tid = tidx + blockIdx.y * blockDim.x * GridDim.x;
which has a theoretical upper limit of 512 * 65535 * 65535 = 2198956147200 unique threads. Fermi will let you add a third dimension to the grid, also of 65535 maximum size, which gives up to about 10^17 threads in a single execution grid. Which is rather a lot.
Upvotes: 1