ReeSSult
ReeSSult

Reputation: 496

CUDA matrix preferred indexing method

Let's say that I have a 4x4 matrix, which is divided into 2x2 block and 2x2 grid, so func<<<(2,2), (2,2)>>>(). The matrix is stored in a 1d array of size 16. The usual method to calculate x and y is the following:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

It seems like the recommended (at least by multiple examples) way to calculate the global index is:

int index = y * width + x;

This would generate the following indices:

blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 0, index = 0
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 1, index = 1
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 0, index = 4
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 1, index = 5

So, on each y increment, the index would be strided, which means that only the x threads would benefit from coalescing. Another way to calculate the index is:

int index = y * blockDim.x + x;

Which would give the following indices:

blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 0, index = 0
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 1, index = 1
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 0, index = 2
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 1, index = 3

In this case, the entire block is coalesced as all threads would access consecutive elements of the array.

Why is the first method generally recommended? Doesn't the second one achieve a better performance?

Upvotes: 0

Views: 383

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151799

Why is the first method generally recommended?

One possibility might be that no one really thinks problems involving a 4x4 matrix accessed across a 4x4 grid are useful to tune for. Once you get to large matrices broken into 32x32 tiles, this becomes moot. (&)

Another way to calculate the index is:

int index = y * blockDim.x + x;

I don't think so. One thread in your grid will have an (x,y) ordered pair of (0,1). Another will have an ordered pair of (2,0). Considering your proposed value for blockDim.x of 2, those two threads will yield the same index value. I don't imagine that is what you want.

(&) And with no loss of generality in my opinion, if I wanted to create threadblocks of less than 32x32 = 1024 threads, I would scale down the block y dimension, e.g. 32x16 for 512 threads, or 32x8 for 256 threads. This allows me to use the same indexing "everywhere".

Upvotes: 3

Related Questions