Data organization in Cuda kernels

Question

I'm new to Cuda and been reading tutorials and other open source code, to try to understand things. I know the general concept of thread hierarchies.

TL;DR, all the tutorials I read have assumed that the data sent to the kernel is also organized in this hierarchy, without explicitly having done so before launching the kernel. Shouldn't the data passed to the kernel be re-arranged in the grid>block>thread hierarchy before being passed to the kernel? Below are two snippets that confused me in this regard.

I followed this x_plus_y tutorial here. In this tutorial, the following snippet:

_global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

In the above snippet, we want to add corresponding elements in x and y, but how do I know, that x and y are so placed in the GPU, that the index i (which is computed using blockIdx,blockDim etc), actually points to corresponding elements of x and y. If x and y are placed one after the other in memory, shouldn't the index used for y take into account length of x? I'm missing some key intuitive understanding here. Also, how do I know where in the GPU has some random element of array been mapped, say x[1011]? Or do I not need to care about explicit positioning of my data, due to some kind of abstraction?

I'll also discuss another snippet, from an open-source torch repo. This is a kernel for computing a distance metric between two sets of point-clouds. Each cloud is an Nx3 matrix (has N 3-D points).

b is the batch-size (so, b number of clouds are passed to the kernel)

n is the number of points in each cloud of the first set

m is the number of points in each cloud of the second set.

Example, the first set of clouds can be (16,1024,3) and second set (16,512,3):

__global__ void NmDistanceKernel(int b,int n,const float * xyz,int m,const float * xyz2,float * result,int * result_i){
    const int batch=512;
    __shared__ float buf[batch*3];
    for (int i=blockIdx.x;i



The above kernel, is launched as follows:

NmDistanceKernel<<>>(batch_size, n, xyz1.data(), m, xyz2.data(), dist1.data(), idx1.data());


Again, in the above kernel, the author has assumed that the data they have passed to the kernel are organized such that the indexing mechanism will work. They didn't explicitly place each point in each thread and then a bunch of points inside a block and a bunch of clouds inside a grid. This structure is however assumed inside the kernel.

Daniel Bauer · Accepted Answer

Before calling a kernel, you must have placed the data into the GPU.

Data is mostly passed in in Arrays of Data, so the structure of these arrays are the same on the GPU as they were in your host code.

In you first example, the arrays x and y are passed in seperately, so the indices for x and y both start at 0. You could pass them in in one big array, and then the indexing would need to be adjusted.

That has been done in your other example. The array xyz consists of the x y and z values of all points. The order goes like x1 y1 z1 x2 y2 z2 x3 y3 z3 .... That is why when accessing the values you see x = [...]+0; y = [...]+1; z = [...]+2;. For the next point, the indices all increase by 3.

To then access you data in kernels, you need to refer to the Identifiers that CUDA gives you. You use the position of the thread inside your grid and blocks.

In the first example, the programmer chose to start of the threads all reading the first consecutive entries in the arrays. He does so by assigning a unique index to each thread:

int index = blockIdx.x * blockDim.x + threadIdx.x;

threadIdx.x tells us where the thread resides in a block, so it would be enough, if we were only launching one block. But then different threads in different blocks would have the same index. We have to separate them by getting their blockIdx.x. The block is blockDim.x long and the first thread in the second block should continue after the last thread in block 1. So the above formula for index forms.

Then, each thread jumps forwards, so that the very first thread next reads the first data after the data the last thread just read, and so on.

The more dimensions your launched grid uses, the more complex these calculations have to be. Try starting out with simple grids and increase the complexity if you're comfortable with them.

Data organization in Cuda kernels

Answers (1)

Related Questions