Streaming data to CUDA kernels without multiple launches

Question

I am trying to GPU accelerate an algorithm where I receive an asynchronous stream of particles in 3D space $p=[x,y,t]$. Each vector $p_n$ needs to be multiplied by a bunch of transformation matrices. Since these transformations are independent of each other they can happen in parallel, so I have written a CUDA kernel to do that. It works well, but of course for each incoming $p_n$ I end up launching the CUDA kernel anew. Launching a CUDA kernels carries a major time penalty, and thus I lose the advantage of GPU acceleration. So my question is, can I keep the kernel open and stream the particles to it somehow?

In case it's any help here is my current kernel:

__global__
void project(float *projection_matrix, float *vector, float *output_matrix) {
    int col_index = blockIdx.x * blockDim.x + threadIdx.x;
    int row_index = blockIdx.y * blockDim.x + threadIdx.y;
    int output_index = (col_index*3 + threadIdx.y);
    int transform_first_element = col_index * 9 + threadIdx.y * 3;
    int stride = blockDim.x*blockDim.y*gridDim.x;

    while (output_index < (NUMBER_OF_TRANSFORMS * 3)) {
        output_matrix[output_index] = projection_matrix[transform_first_element]*vector[0]+ projection_matrix[(transform_first_element+1)]*vector[1] + projection_matrix[(transform_first_element+2)]*vector[2];
        output_index += stride;
    }
}

and this is where I call it:

...
project <<>> (transformationList, inputVector, outputMatrix);
cudaDeviceSynchronize();
...

Zalman Stern · Accepted Answer

You'll need to batch the requests up into a larger block and invoke a kernel on many particles. You can likely use the third dimension of the kernel to iterate over them. One way to do this is to accumulate incoming particles while the kernel is running. If you do not get enough particles to justify the kernel launch, process them on the CPU.

If the particles are being produced on the GPU, you have the option to launch a kernel from a kernel with newer versions of CUDA, but you still need a pretty large block to make that win.

If these are coming from the CPU and then going back to the CPU, I'd be surprised if you can make it pay off at all unless the number of matrices is pretty large. (Comparing to well optimized SIMD CPU code.)

Streaming data to CUDA kernels without multiple launches

Answers (1)

Related Questions