Bolster
Bolster

Reputation: 7906

CUDA: Method of partitioning *huge* problems?

All this CUDA lark is head-melting in its power, but something I've been wondering about is the hard-limits on 1d block/grid dimensions (usually 512/65535 respectively).

When dealing with problems that are much larger in their scope (in the order of billions), is there an automated programmatic way of effectively setting a 'queue' through a kernel, or is it a case of manual slicing and dicing?

How does everyone deal with problem-partitioning?

Upvotes: 2

Views: 1377

Answers (2)

Ashwin Nanjappa
Ashwin Nanjappa

Reputation: 78478

There are 2 basic ways of partitioning your data, so that you can work on it using CUDA:

  1. Break data down into contiguous chunks, such that each thread works on one chunk.
  2. Each thread nibbles at one element of data. When all threads are done, they shift themselves by numberOfThreads and repeat again.

I have explained these techniques with simple examples here. Method 2 is typically easier to code and work with for most tasks.

Upvotes: 1

talonmies
talonmies

Reputation: 72349

If one dimensional grids are too small, just use two dimensional (or three dimensional on Fermi with CUDA 4.0) grids instead. Dimensionality in grid and block layouts is really only for convenience - it makes the execution space look like the sort of common data parallel inputs spaces programmers are used to working with (matrices, grids, voxels, etc). But it is only a very small abstraction away from the underlying simple linear numbering scheme that can handle over 10^12 unique thread IDs within a single kernel launch.

In grids, ordering is column major, so if you had a 1D grid problem before, the "unique, 1D thread index" was calculated as:

unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

which has a theoretical upper limit of 512 * 65535 = 33553920 unique threads. The equivalent 2D grid problem is only a simple extension of the 1D case

size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;
size_t tid = tidx + blockIdx.y * blockDim.x * GridDim.x;

which has a theoretical upper limit of 512 * 65535 * 65535 = 2198956147200 unique threads. Fermi will let you add a third dimension to the grid, also of 65535 maximum size, which gives up to about 10^17 threads in a single execution grid. Which is rather a lot.

Upvotes: 1

Related Questions