Ignat Georgiev
Ignat Georgiev

Reputation: 11

Understanding CUDA indexing

I inherited some CUDA code that I need to work on but some of the indexing done in it is confusing me.

A simple example would be the normalisation of data. Say we have a shared array A[2*N] which is a matrix of shape 2xN which has been unrolled to an array. Then we have the normalisation means and standard deviation: norm_means[2] and norm_stds[2]. The goal is to normalise the data in A in parallel. A minimal example would be:

__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;

for (int i=tdy; i<D; i+=blockDim.y)
  data[i] = data[i] * norm[i] + std[i];
}

int main(int argc, char **argv) {

// generate data
int N=100;
int D=2;
MatrixXd A = MatrixXd::Random(N*D,1);
MatrixXd norm_means = MatrixXd::Random(D,1);
MatrixXd norm_stds = MatrixXd::Random(D,1);

// transfer data to device
float* A_d;
float* norm_means_d;
float* nrom_stds_d;
cudaMalloc((void **)&A_d, N * D * sizeof(float));
cudaMalloc((void **)&norm_means_d, D * sizeof(float));
cudaMalloc((void **)&norm_stds_d, D * sizeof(float));
cudaMemcpy(A_d, A.data(), D * N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_means_d, norm_means.data(), D * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_stds_d, norm_stds.data(), D * sizeof(float), cudaMemcpyHostToDevice);

// Setup execution
const int BLOCKSIZE_X = 8;
const int BLOCKSIZE_Y = 16;
const int GRIDSIZE_X = (N-1)/BLOCKSIZE_X + 1;
dim3 dimBlock(BLOCKSIZE_X, BLOCKSIZE_Y, 1);
dim3 dimGrid(GRIDSIZE_X, 1, 1);

normalise<<dimGrid, dimBlock, 0>>>(A_d, norm_means_d, norm_stds_d);

}

Note that I am using Eigen for the matrix generation. I have omitted the includes for brevity.

This code above through some magic works and achieves the desired results. However, the CUDA kernel function does not make any sense to me because the for loop should stop after one execution as i>D after the first iteration .. but it doesn't?

If I change the kernel that makes more sense to me eg.

__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=0; i<D; i++)
  data[tdy + i*blockDim.y] = data[tdy + i*blockDim.y] * norm[i] + std[i];
}

the program stops working and just outputs gibberish data.

Can somebody explain why I get this behaviour?

PS. I am very new to CUDA

Upvotes: 0

Views: 83

Answers (1)

einpoklum
einpoklum

Reputation: 131547

It is indeed senseless to have a 2-dimensional kernel to perform an elementwise operation on an array. There is also no reason to work in blocks of size 8x16. But your modified kernel uses the second dimension (y) only; that's probably why it doesn't work. You probably needed to use the first dimension (x) only.

However - it would be reasonable to use the Y dimension for the actual second dimension, e.g. something like this:

__global__ void normalize(
          float __restrict *data,
    const float __restrict *norm,
    const float __restrict *std) 
{
    auto pos = threadIdx.x + blockDim.x * blockIdx.x;
    auto d = threadIdx.y + blockDim.y * blockIdx.y; // or maybe just threadIdx.y;
    data[pos + d * N] = data[pos + d * N] * norm[d] + std[d];
}

Other points to consider:

  • I added __restrict to your pointers. Always do that when relevant; here's why.
  • It's a good idea to have a single thread to work on more than one element of data - but you should make that happen in the longer dimension, where the thread can reuse its norm and std values rather than read them from memory every time.

Upvotes: 2

Related Questions