Why is the first cudaMalloc the only bottleneck?

Question

I defined this function :

void cuda_entering_function(...)
{
    StructA *host_input, *dev_input;
    StructB *host_output, *dev_output;

    host_input = (StructA*)malloc(sizeof(StructA));
    host_output = (StructB*)malloc(sizeof(StructB));
    cudaMalloc(&dev_input, sizeof(StructA));
    cudaMalloc(&dev_output, sizeof(StructB));

    ... some more other cudaMalloc()s and cudaMemcpy()s ...

    cudaKernel<< ... >>(dev_input, dev_output);

    ...
}

This function is called several times (about 5 ~ 15 times) throughout my program, and I measured this program's performance using gettimeofday().

Then I found that the bottleneck of cuda_entering_function() is the first cudaMalloc() - the very first cudaMalloc() throughout my whole program. Over 95% of the total execution time of cuda_entering_function() was consumed by the first cudaMalloc(), and this also happens when I changed the size of first cudaMalloc()'s allocating memory or when I changed the executing order of cudaMalloc()s.

What is the reason and is there any way to reduce the first cuda allocating time?

Etienne Pellegrini · Accepted Answer

The first cudaMalloc is responsible for the initialization of the device too, because it's the first call to any function involving the device. This is why you take such a hit: it's overhead due to the use of CUDA and your GPU. You should make sure that your application can gain a sufficient speedup to compensate for the overhead.

In general, people use a call to an initialization function in order to setup their device. In this answer, you can see that apparently a call to cudaFree(0) is the canonical way to do so. This sample shows the use of cudaSetDevice, which could be a good habit if you ever work on machines with several CUDA-ready devices.

Why is the first cudaMalloc the only bottleneck?

Answers (1)

Related Questions