Why a simple CUDA function needs so much local memory?

Question

I've wrote a simple function on CUDA. It's resize an image to double scale. For an image at 1920*1080, this function need ~20ms to complete. I've tried some different way to optimize that function. And I found that may be local memory is the key reason.

I have tried three different method to fetch image.

The Gpu module in OpenCV
Texture bind to GpuMat in OpenCV
Direct fetch GpuMat from global memory

None of them could bring me a little improve.

Then I using the nvvp to find out the reason. And the local memory overhead is ~95% in all three conditions above.

So I turn to my code to find out how nvcc using memory. Then I found that a simple function just like this:

__global__ void performDoubleImage(float* outData, size_t step, const int cols, const int rows)
{
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    if (x >= cols)
        return;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    if (y >= rows)
        return;
    ((float*)((size_t)outData+y*step))[x] = tex2D(texRef, x/2, y/2);
}

needs 80 bytes stack frame (they're in local memory).

And another function like this:

__global__ void performFinalDoubleImage(const PtrStepSz in, PtrStepSz out)
{
    out(out.rows-1, out.cols-1) = out(in.rows-1, in.cols-1);
}

also needs 88 bytes stack frame.

The question is, why my function using so much local memory and registers in this simple task? And why the function in OpenCV could perform same function by using no local memory (this is test by nvvp, the local memory load is ZERO)?

My code is compiled on debug mode. And my card is GT650(192 SP/SM, 2 SM).

user703016 · Accepted Answer

The two functions you've posted are way too simple to be using that much stack, in fact they shouldn't be using stack at all. The most likely reason that there is so much spilling is that you are compiling with optimizations disabled (for example, in debug mode).

For reference, Robert Crovella compiled your first kernel in release and in debug mode:

Debug:

ptxas info : Function properties for _Z18performDoubleImagePfmii 256 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 23 registers, 296 bytes cumulative stack size, 56 bytes cmem[0], 1 textures

Release:

ptxas info : Function properties for _Z18performDoubleImagePfmii 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 9 registers, 56 bytes cmem[0], 1 textures

Note the difference in stack and register usage. As noted in the comments, when measuring the performance of a program, you should always be compiling for the maximum optimization level, otherwise the measurements will be meaningless.

Why a simple CUDA function needs so much local memory?

Answers (1)

Related Questions