Reputation: 153
I have installed the CUDA runtime and drivers version 7.0 to my workstation (Ubuntu 14.04, 2xIntel XEON e5 + 4x Tesla k20m). I've used the following program to check whether my installation works:
#include <stdio.h>
__global__ void helloFromGPU()
{
printf("Hello World from GPU!\n");
}
int main(int argc, char **argv)
{
printf("Hello World from CPU!\n");
helloFromGPU<<<1, 1>>>();
printf("Hello World from CPU! Again!\n");
cudaDeviceSynchronize();
printf("Hello World from CPU! Yet again!\n");
return 0;
}
I get the correct output, but it's taken an enourmus amount of time:
$ nvcc hello.cu -O2
$ time ./hello > /dev/null
real 0m8.897s
user 0m0.004s
sys 0m1.017s`
If I remove all device code the overall execution takes 0.001s. So why does my simple program almost take 10 seconds?
Upvotes: 7
Views: 624
Reputation: 72348
The apparent slow runtime of your example is due to the underlying fixed cost of setting up the GPU context.
Because you are running on a platform that supports unified addressing, the CUDA runtime has to map 64GB of host RAM and 4 x 5120MB from your GPUs into a single virtual address space and register that with the Linux kernel.
There are a lot of kernel API calls required to do that, and it isn't fast. I would guess that is the main source of the slow performance you are observing. You should view this as a fixed start-up cost which must be amortised over the life of your application. In real world applications, a 10 second startup is trivial and of no real importance. In a hello world example, it isn't.
Upvotes: 7