Reputation: 496
Let's say that I have a 4x4 matrix, which is divided into 2x2 block and 2x2 grid, so func<<<(2,2), (2,2)>>>()
. The matrix is stored in a 1d array of size 16. The usual method to calculate x
and y
is the following:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
It seems like the recommended (at least by multiple examples) way to calculate the global index is:
int index = y * width + x;
This would generate the following indices:
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 0, index = 0
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 1, index = 1
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 0, index = 4
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 1, index = 5
So, on each y
increment, the index would be strided, which means that only the x
threads would benefit from coalescing. Another way to calculate the index is:
int index = y * blockDim.x + x;
Which would give the following indices:
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 0, index = 0
blockIdx.x,y = 0, threadIdx.y = 0, threadIdx.x = 1, index = 1
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 0, index = 2
blockIdx.x,y = 0, threadIdx.y = 1, threadIdx.x = 1, index = 3
In this case, the entire block is coalesced as all threads would access consecutive elements of the array.
Why is the first method generally recommended? Doesn't the second one achieve a better performance?
Upvotes: 0
Views: 383
Reputation: 151799
Why is the first method generally recommended?
One possibility might be that no one really thinks problems involving a 4x4 matrix accessed across a 4x4 grid are useful to tune for. Once you get to large matrices broken into 32x32 tiles, this becomes moot. (&)
Another way to calculate the index is:
int index = y * blockDim.x + x;
I don't think so. One thread in your grid will have an (x,y)
ordered pair of (0,1)
. Another will have an ordered pair of (2,0)
. Considering your proposed value for blockDim.x
of 2, those two threads will yield the same index
value. I don't imagine that is what you want.
(&) And with no loss of generality in my opinion, if I wanted to create threadblocks of less than 32x32 = 1024 threads, I would scale down the block y dimension, e.g. 32x16 for 512 threads, or 32x8 for 256 threads. This allows me to use the same indexing "everywhere".
Upvotes: 3