Link L
Link L

Reputation: 439

Tensorflow: what does index denote in CUDA_1D_KERNEL_LOOP(index, nthreads) op user

I have seen in several standard ops of tensorflow layers ( such as https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/maxpooling_op_gpu.cu.cc ), the code CUDA_1D_KERNEL_LOOP(index, nthreads) as part of the Forward and Backward passes...

I think the "index" here is related somehow to the bottom feature map coordinates but am not so sure of its exact meaning... Anyone who could help?

Upvotes: 3

Views: 1449

Answers (2)

Edward Z. Yang
Edward Z. Yang

Reputation: 26742

One thing that puzzled me when I first read this macro is, "Why is it a loop, isn't this inside the kernel which has already been parallelized?" The answer is, the loop handles the case when you have more "threads" than your GPU actually supports.

For example, suppose you are doing a parallelized vector addition, and you have decided that for your GPU, you will be using 512 threads per block, scheduling a maximum of 4096 blocks (these are the default parameters in Caffe2). This means that you can only schedule a maximum of 2097152 threads. Suppose your vector actually has 4M elements; now you can't actually allocate a thread per element. So each thread must be responsible for summing more than one element in the vector: that is what this loop is for!

Here is a smaller example which precisely describes how work ends up being scheduled. Suppose that blockDim.x == 2, gridDim.x == 2, and nthreads == 7. Then if we identify a GPU thread as (blockIdx.x, threadIdx.x), we allocate them to do the following work on the vector: [(0,0), (0,1), (1,0), (1,1), (0,0), (0,1), (1,0)]. In particular, we can see that according to the grid size, there are only four GPU threads available; so for blockIdx.x == 0 threadIdx.x == 0, index will handle processing the vector elements at BOTH 0 and 4.

Upvotes: 2

woodshop
woodshop

Reputation: 116

CUDA_1D_KERNEL_LOOP(i, n) is a preprocessor macro defined in tensorflow/core/util/cuda_kernel_helper.h. It provides a generic control flow statement that is used in many Cuda kernels within the Tensorflow codebase.

The statement is typically used for iterating through elements of an array within a kernel. The argument i is the name of the control variable and the argument n is the stopping condition for the control statement. Cuda kernels are launched in parallel threads. Each thread typically operates on a subset of array elements. The macro provides some convenience for accessing the desired array elements.

In the example that you link to, CUDA_1D_KERNEL_LOOP(index, nthreads) is interpreted as:

for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < nthreads; index += blockDim.x * gridDim.x)

Hence index is declared and initialized within CUDA_1D_KERNEL_LOOP before entering the subsequent code block. The precise meaning of index depends on how it is used within the code block.

Upvotes: 6

Related Questions