Using dynamic parallelism results in 30x worse performance

Question

Note: I don't have my computer and GPU with me so this me typing from memory. I timed this and compiled it correctly so ignore any odd typos should they exist.

I don't know if the overhead of what I'm going to describe below is the problem, or if I'm doing this wrong, or why launching kernels in kernels is slower than one big kernel that has a lot of threads predicate off and not get used. Maybe this is because I'm not swamping the GPU with work that I don't notice the saturation.

Suppose we're doing something simple for the sake of this example, like multiplying all the values in a square matrix by two. The matrices can be any size, but they won't be larger than 16x16.

Now suppose I have 200 matrices all in the device memory ready to go. I launch a kernel like

// One matrix given to each block
__global__ void matrixFunc(Matrix** matrices)
{
    Matrix* m = matrices[blockIdx.x];
    int area = m->width * m->height;
    if (threadIdx.x < area)
        // Heavy calculations
}

// Assume 200 matrices, no larger than 16x16
matrixFunc<<<200, 256>>>(ptrs);

whereby I'm using one block per matrix, and an abundance of threads such that I know I'm never going to have less threads per block than cells in a matrix.

The above runs in 0.17 microseconds.

This seems wasteful. I know that I have a bunch of small matrices (so 256 threads is overkill when a 2x2 matrix can function on 4 threads), so why not launch a bunch of them dynamically from a kernel to see what the runtime overhead is? (for learning reasons)

I change my code to be like the following:

__device__ void matrixFunc(float* matrix)
{
    // Heavy calculations (on threadIdx.x for the cell)
}

__global__ void matrixFuncCaller(Matrix** matrices)
{
    Matrix* m = matrices[threadIdx.x];
    int area = m->width * m->height;
    matrixFunc<<<1, area>>>(m.data);
}

matrixFuncCaller<<<1, 200>>>(ptrs);

But this performs a lot worse at 11.3 microseconds.

I realize I could put them all on a stream, so I do that. I then change this to make a new stream:

__global__ void matrixFuncCaller(Matrix** matrices)
{
    Matrix* m = matrices[threadIdx.x];
    int area = m->width * m->height;
    // Create `stream`
    matrixFunc<<<1, area, 0, stream>>>(m.data);
    // Destroy `stream`
}

This does better, it's now 3 microseconds instead of 11, but it's still much worse than 0.17 microseconds.

I want to know why this is worse.

Is this kernel launching overhead? I figure that maybe my examples are small enough such that the overhead drowns out the work seen here. In my real application which I cannot post, there is a lot more work done than just "2 * matrix", but it still is probably small enough that there might be decent overhead.

Am I doing anything wrong?

J&#233;r&#244;me Richard · Accepted Answer

Put it shortly: the benchmark is certainly biased and the computation is latency bound.

I do not know how did you measure the timings but I do not believe "0.17 microseconds" is even possible. In fact the overhead of launching a kernel is typically few microseconds (I never saw an overhead smaller than 1 microsecond). Indeed, running a kernel should typically require a system call that are expensive and known to take an overhead of at least about 1000 cycles. An example of overhead analysis can be found in this research paper (confirming that it should takes several microseconds). Not to mention current RAM accesses should take at least 50-100 ns on mainstream x86-64 platforms and the one one of GPU requires several hundreds of cycles. While everything may fit in both the CPU and GPU cache is possible this is very unlikely to be the case regarding the kernels (and the fact the GPU may be used for other tasks during multiple kernel executions). For more information about this, please read this research paper. Thus, what you measure has certainly nothing to do with the kernel execution. To measure the overhead of the kernel, you need to care about synchronizations (eg. call cudaDeviceSynchronize) since kernels are launched asynchronously.

When multiple kernels are launched, you may pay the overhead of an implicit synchronization since the queue is certainly bounded (for sake of performance). In fact, as pointed out by @talonmies in the comments, the number of concurrent kernels is bounded to 16-128 (so less than the number of matrices).

Using multiple streams reduces the need for synchronizations hence the better performance results but there is certainly still a synchronization. That being said, for the comparison to be fair, you need to add a synchronization in all cases or measure the execution time on the GPU itself (without taking care of the launching overhead) still in all cases.

Profilers like nvvp help a lot to understand what is going on in such a case. I strongly advise you to use them.

As for the computation, please note that GPU are designed for heavy computational SIMT-friendly kernels, not low-latency kernel operating on small variable-sized matrices stored in unpredictable memory locations. In fact, the overhead of a global memory access is so big that it should be much bigger than the actual matrix computation. If you want GPUs to be useful, then you need to submit more work to them (so to provide more parallelism to them and so to overlap the high latencies). If you cannot provide more work, then the latency cannot be overlapped and if you care about microsecond latencies then GPUs are clearly not suited for the task.

By the way, not that Nvidia GPUs operate on warp of typically 32 threads. Threads should perform coalesced memory loads/stores to be efficient (otherwise they are split in many load/store requests). Operating on very small matrices like this likely prevent that. Not to mention most threads will do nothing. Flattening the matrices and sorting them by size as proposed by @sebastian in the comments help a bit but the computations and memory access will still be very inefficient for a GPU (not SIMT-friendly). Note that using less thread and make use of unrolling should also be a bit more efficient (but still far from being great). CPUs are better suited for such a task (thanks to a higher frequency, instruction-level parallelism combined with an out-of-order execution). For fast low-latency kernels like this FPGAs can be even better suited (though they are hard to program).

Using dynamic parallelism results in 30x worse performance

Answers (1)

Related Questions