Error using __ldg in cuda kernel at compile time

Question

My goal is to take advantage of cache memory in my application and searching for online examples shows that using __ldg should be relatively straightforward.

NVIDIA has documentation for GPU optimization (found here: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf) which provides the straightforward example:

__global__ void kernel ( int *output, int *input)
{
  ...
  output[idx] = __ldg( &input[idx] );
}

However when I try to compile this I get the following error message:

error: identifier "__ldg" is undefined.

Searching Google for a solution to this error message has been unfortunately unhelpful. Any suggestions what may be wrong with this simple example?
Is there a compiler flag that I am missing?

For reference my device is compute capability 3.5 and I am working with CUDA 5.5.

Thank you.

Robert Crovella · Accepted Answer

The __ldg() intrinsic is only available on compute capability 3.5 (or newer) architecture.

That means:

It must be run on a compute 3.5 (or newer) GPU
It must be compiled for a compute 3.5 (or newer) GPU
It cannot also be compiled for an older architecture.

That means:

This won't work: nvcc -arch=sm_30 ...
This will work: nvcc -arch=sm_35 ...
This won't work: nvcc -gencode arch=compute30,code=sm_30 -gencode arch=compute_35,code=sm_35 ...

Error using __ldg in cuda kernel at compile time

Answers (2)

Related Questions