Max
Max

Reputation: 101

CUDA memory allocation performance

I'm working with image filters on CUDA. Image processing is much faster than it is on CPU. But the problem is that the allocation of the image is really slow.

That is how I allocate memory and set the image.

hr = cudaMalloc(&m_device.originalImage,    size);                                                                          
hr = cudaMalloc(&m_device.modifiedImage,    size);                                                                          
hr = cudaMalloc(&m_device.tempImage,    size);                                                                  
hr = cudaMemset( m_device.modifiedImage, 0, size);                                                                          
hr = cudaMemcpy( m_device.originalImage, host.originalImage, size, cudaMemcpyHostToDevice); 

And here is the result of executing the program.

C:\cpu_gpu_filters(GPU)\x64\Release>cpu_gpu_filters test-case.txt
C:\Users\Max\Desktop\test_set\cheshire_cat_1280x720.jpg
Init time: 519 ms
Time spent: 2.35542 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_1366x768.jpg
Init time: 31 ms
Time spent: 2.68595 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_1600x900.jpg
Init time: 44 ms
Time spent: 3.54835 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_1920x1080.jpg
Init time: 61 ms
Time spent: 4.98131 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_2560x1440.jpg
Init time: 107 ms
Time spent: 9.0727 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_3840x2160.jpg
Init time: 355 ms
Time spent: 20.1453 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_5120x2880.jpg
Init time: 449 ms
Time spent: 35.815 ms
C:\Users\Max\Desktop\test_set\cheshire_cat_7680x4320.jpg
Init time: 908 ms
Time spent: 75.4647 ms

UPD Code with time measuring:

start = high_resolution_clock::now();
Initialize();
stop = high_resolution_clock::now();
long long ms = duration_cast<milliseconds>(stop - start).count();
long long us = duration_cast<microseconds>(stop - start).count();
cout << "Init time: " << ms << " ms" << endl;


start = high_resolution_clock::now();
GpuTimer gpuTimer;
gpuTimer.Start();
RunGaussianBlurKernel(
    m_device.modifiedImage,
    m_device.tempImage,
    m_device.originalImage, 
    m_device.filter,
    m_filter.width,
    m_host.originalImage.rows, 
    m_host.originalImage.cols
    );
gpuTimer.Stop();

The first image is the smallest, but initialization takes 519 ms. Maybe, it's because of the necessity to load the drivers or something. Then, when the size of the image increases, initialization time increases as well. Actually, this looks logical, but I'm still not sure that initialization process should be that slow. Am I doing something wrong?

Upvotes: 0

Views: 1288

Answers (1)

Florent DUGUET
Florent DUGUET

Reputation: 2916

In your unit code, you have a cudaMemset which execution time depends on size. There is also the cudaMemcpy, which execution time is approximately given by the mem copy size in bytes divided by the bandwidth of the PCI-Express. It is very likely that this part is responsible for the increase in init time. Running it through NSIGHT will provide you with more precise figures on execution time. However, without a MCVE, hard to answer for sure.

Upvotes: 3

Related Questions