CUDA Matrix Copy

Question

I have been trying to understand thread and block indexing pattern in simple matrix copy example http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

Why do we use TILE_DIM as a stride while calculating y since we know that our Block size is (TILE_DIM * BLOCK_ROWS). Besides we are amortizing the calculation by forcing each thread to do TILE_DIM / BLOCK_ROWS copies. I tried considering Threads Per Block as (4,1) and Blocks Per Grid as (2,2) with square matrix width as 8. I find that the offset values created also go beyond 15 which is above the matrix linear (1D) dimensions. Kindly help using an example if possible. I would like to see some links tutorial on Matrix tiling with amortization explained in detail.

const int TILE_DIM = 32;
const int BLOCK_ROWS = 8;

__global__ void copy(float *odata, const float *idata)
{
  int x = blockIdx.x * TILE_DIM + threadIdx.x;
  int y = blockIdx.y * TILE_DIM + threadIdx.y;
  int width = gridDim.x * TILE_DIM;

  for (int j = 0; j < TILE_DIM; j+= BLOCK_ROWS)
  odata[(y+j)*width + x] = idata[(y+j)*width + x];
}

...

const int nx = 1024;
const int ny = 1024;

dim3 dimGrid(nx/TILE_DIM, ny/TILE_DIM, 1);
dim3 dimBlock(TILE_DIM, BLOCK_ROWS, 1);

....

copy<<>>( ... );

Vitality · Accepted Answer

Why do we use TILE_DIM as a stride while calculating y since we know that our block size is TILE_DIM * BLOCK_ROWS.

According to your code, you are "tiling" your 1024x1024 matrix into 32x32 tiles, each of 32x32 elements. This is done by defining the grid size by

dim3 dimGrid(nx/TILE_DIM, ny/TILE_DIM, 1);

Also, you are partitioning each tile in 4 subtiles of size 8x32 each. This is done by defining the block size by

dim3 dimBlock(TILE_DIM, BLOCK_ROWS, 1);

Each block should copy an entire tile, as you noticed. By resorting to the definitions of x and y, rewrite the index (y+j)*width + x as

y*width + j*width + x = 
    (blockIdx.y * TILE_DIM)*width + blockIdx.x * TILE_DIM +
    (j+threadIdx.y)*width + threadIdx

The first term (blockIdx.y * TILE_DIM)*width + blockIdx.x * TILE_DIM identifies the matrix element with index (blockIdx.y * TILE_DIM, blockIdx.x * TILE_DIM) which in turn is the uppermost element of the tile identified by blockIdx.x and blockIdx.y. The second element (j+threadIdx.y)*width + threadIdx.x identifies the element with index threadIdx.x,j+threadIdx.y within the generic tile. Now, consider that threadIdx.x ranges between 0 and TILE_DIM-1, while threadIdx.y ranges within 0 and BLOCK_ROWS. This explains why you need the for loop so that j+threadIdx.y can span the whole interval (0,TILE_DIM-1).

CUDA Matrix Copy

Answers (1)

Related Questions