Reputation: 11
I inherited some CUDA code that I need to work on but some of the indexing done in it is confusing me.
A simple example would be the normalisation of data. Say we have a shared array A[2*N]
which is a matrix of shape 2xN which has been unrolled to an array. Then we have the normalisation means and standard deviation: norm_means[2]
and norm_stds[2]
. The goal is to normalise the data in A in parallel. A minimal example would be:
__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=tdy; i<D; i+=blockDim.y)
data[i] = data[i] * norm[i] + std[i];
}
int main(int argc, char **argv) {
// generate data
int N=100;
int D=2;
MatrixXd A = MatrixXd::Random(N*D,1);
MatrixXd norm_means = MatrixXd::Random(D,1);
MatrixXd norm_stds = MatrixXd::Random(D,1);
// transfer data to device
float* A_d;
float* norm_means_d;
float* nrom_stds_d;
cudaMalloc((void **)&A_d, N * D * sizeof(float));
cudaMalloc((void **)&norm_means_d, D * sizeof(float));
cudaMalloc((void **)&norm_stds_d, D * sizeof(float));
cudaMemcpy(A_d, A.data(), D * N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_means_d, norm_means.data(), D * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(norm_stds_d, norm_stds.data(), D * sizeof(float), cudaMemcpyHostToDevice);
// Setup execution
const int BLOCKSIZE_X = 8;
const int BLOCKSIZE_Y = 16;
const int GRIDSIZE_X = (N-1)/BLOCKSIZE_X + 1;
dim3 dimBlock(BLOCKSIZE_X, BLOCKSIZE_Y, 1);
dim3 dimGrid(GRIDSIZE_X, 1, 1);
normalise<<dimGrid, dimBlock, 0>>>(A_d, norm_means_d, norm_stds_d);
}
Note that I am using Eigen for the matrix generation. I have omitted the includes for brevity.
This code above through some magic works and achieves the desired results. However, the CUDA kernel function does not make any sense to me because the for loop should stop after one execution as i>D
after the first iteration .. but it doesn't?
If I change the kernel that makes more sense to me eg.
__global__ void normalise(float *data, float *norm, float *std) {
int tdy = threadIdx.y;
for (int i=0; i<D; i++)
data[tdy + i*blockDim.y] = data[tdy + i*blockDim.y] * norm[i] + std[i];
}
the program stops working and just outputs gibberish data.
Can somebody explain why I get this behaviour?
PS. I am very new to CUDA
Upvotes: 0
Views: 83
Reputation: 131547
It is indeed senseless to have a 2-dimensional kernel to perform an elementwise operation on an array. There is also no reason to work in blocks of size 8x16. But your modified kernel uses the second dimension (y) only; that's probably why it doesn't work. You probably needed to use the first dimension (x) only.
However - it would be reasonable to use the Y dimension for the actual second dimension, e.g. something like this:
__global__ void normalize(
float __restrict *data,
const float __restrict *norm,
const float __restrict *std)
{
auto pos = threadIdx.x + blockDim.x * blockIdx.x;
auto d = threadIdx.y + blockDim.y * blockIdx.y; // or maybe just threadIdx.y;
data[pos + d * N] = data[pos + d * N] * norm[d] + std[d];
}
Other points to consider:
__restrict
to your pointers. Always do that when relevant; here's why.Upvotes: 2