Reputation: 192
I have been trying to understand thread and block indexing pattern in simple matrix copy example http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/
Why do we use TILE_DIM as a stride while calculating y since we know that our Block size is (TILE_DIM * BLOCK_ROWS). Besides we are amortizing the calculation by forcing each thread to do TILE_DIM / BLOCK_ROWS copies. I tried considering Threads Per Block as (4,1) and Blocks Per Grid as (2,2) with square matrix width as 8. I find that the offset values created also go beyond 15 which is above the matrix linear (1D) dimensions. Kindly help using an example if possible. I would like to see some links tutorial on Matrix tiling with amortization explained in detail.
const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;
__global__ void copy(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
odata[(y+j)*width + x] = idata[(y+j)*width + x];
}
...
const int nx = 1024;
const int ny = 1024;
dim3 dimGrid(nx/TILE_DIM, ny/TILE_DIM, 1);
dim3 dimBlock(TILE_DIM, BLOCK_ROWS, 1);
....
copy<<<dimGrid, dimBlock>>>( ... );
Upvotes: 0
Views: 1838
Reputation: 21475
Why do we use
TILE_DIM
as a stride while calculatingy
since we know that our block size isTILE_DIM * BLOCK_ROWS
.
According to your code, you are "tiling" your 1024x1024
matrix into 32x32
tiles, each of 32x32
elements. This is done by defining the grid size by
dim3 dimGrid(nx/TILE_DIM, ny/TILE_DIM, 1);
Also, you are partitioning each tile in 4
subtiles of size 8x32
each. This is done by defining the block size by
dim3 dimBlock(TILE_DIM, BLOCK_ROWS, 1);
Each block should copy an entire tile, as you noticed. By resorting to the definitions of x
and y
, rewrite the index (y+j)*width + x
as
y*width + j*width + x =
(blockIdx.y * TILE_DIM)*width + blockIdx.x * TILE_DIM +
(j+threadIdx.y)*width + threadIdx
The first term (blockIdx.y * TILE_DIM)*width + blockIdx.x * TILE_DIM
identifies the matrix element with index (blockIdx.y * TILE_DIM, blockIdx.x * TILE_DIM)
which in turn is the uppermost element of the tile identified by blockIdx.x
and blockIdx.y
. The second element (j+threadIdx.y)*width + threadIdx.x
identifies the element with index threadIdx.x,j+threadIdx.y
within the generic tile. Now, consider that threadIdx.x
ranges between 0
and TILE_DIM-1
, while threadIdx.y
ranges within 0
and BLOCK_ROWS
. This explains why you need the for
loop so that j+threadIdx.y
can span the whole interval (0,TILE_DIM-1)
.
Upvotes: 1