Reputation: 99
I defined this function :
void cuda_entering_function(...)
{
StructA *host_input, *dev_input;
StructB *host_output, *dev_output;
host_input = (StructA*)malloc(sizeof(StructA));
host_output = (StructB*)malloc(sizeof(StructB));
cudaMalloc(&dev_input, sizeof(StructA));
cudaMalloc(&dev_output, sizeof(StructB));
... some more other cudaMalloc()s and cudaMemcpy()s ...
cudaKernel<< ... >>(dev_input, dev_output);
...
}
This function is called several times (about 5 ~ 15 times) throughout my program, and I measured this program's performance using gettimeofday()
.
Then I found that the bottleneck of cuda_entering_function()
is the first cudaMalloc()
- the very first cudaMalloc()
throughout my whole program. Over 95% of the total execution time of cuda_entering_function()
was consumed by the first cudaMalloc()
, and this also happens when I changed the size of first cudaMalloc()
's allocating memory or when I changed the executing order of cudaMalloc()
s.
What is the reason and is there any way to reduce the first cuda allocating time?
Upvotes: 3
Views: 639
Reputation: 748
The first cudaMalloc
is responsible for the initialization of the device too, because it's the first call to any function involving the device. This is why you take such a hit: it's overhead due to the use of CUDA and your GPU. You should make sure that your application can gain a sufficient speedup to compensate for the overhead.
In general, people use a call to an initialization function in order to setup their device. In this answer, you can see that apparently a call to cudaFree(0)
is the canonical way to do so. This sample shows the use of cudaSetDevice, which could be a good habit if you ever work on machines with several CUDA-ready devices.
Upvotes: 9