user3280204
user3280204

Reputation: 53

Error using __ldg in cuda kernel at compile time

My goal is to take advantage of cache memory in my application and searching for online examples shows that using __ldg should be relatively straightforward.

NVIDIA has documentation for GPU optimization (found here: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf) which provides the straightforward example:

__global__ void kernel ( int *output, int *input)
{
  ...
  output[idx] = __ldg( &input[idx] );
}

However when I try to compile this I get the following error message:

error: identifier "__ldg" is undefined.  

Searching Google for a solution to this error message has been unfortunately unhelpful. Any suggestions what may be wrong with this simple example?
Is there a compiler flag that I am missing?

For reference my device is compute capability 3.5 and I am working with CUDA 5.5.

Thank you.

Upvotes: 4

Views: 6036

Answers (2)

Kipton Barros
Kipton Barros

Reputation: 21112

For an implementation of __ldg that generalizes to arbitrary types and correctly falls back on compute capability less than 3.5, see the BryanCatanzaro/generics Github project.

Here is a bare bones template:

template<typename T>
__device__ __forceinline__ T ldg(const T* ptr) {
#if __CUDA_ARCH__ >= 350
    return __ldg(ptr);
#else
    return *ptr;
#endif
}

Upvotes: 7

Robert Crovella
Robert Crovella

Reputation: 152143

The __ldg() intrinsic is only available on compute capability 3.5 (or newer) architecture.

That means:

  1. It must be run on a compute 3.5 (or newer) GPU
  2. It must be compiled for a compute 3.5 (or newer) GPU
  3. It cannot also be compiled for an older architecture.

That means:

  1. This won't work: nvcc -arch=sm_30 ...
  2. This will work: nvcc -arch=sm_35 ...
  3. This won't work: nvcc -gencode arch=compute30,code=sm_30 -gencode arch=compute_35,code=sm_35 ...

Upvotes: 10

Related Questions