Reputation: 53
My goal is to take advantage of cache memory in my application and searching for online examples shows that using __ldg
should be relatively straightforward.
NVIDIA
has documentation for GPU
optimization (found here: https://www.olcf.ornl.gov/wp-content/uploads/2013/02/GPU_Opt_Fund-CW1.pdf) which provides the straightforward example:
__global__ void kernel ( int *output, int *input)
{
...
output[idx] = __ldg( &input[idx] );
}
However when I try to compile this I get the following error message:
error: identifier "__ldg" is undefined.
Searching Google for a solution to this error message has been unfortunately unhelpful. Any suggestions what may be wrong with this simple example?
Is there a compiler flag that I am missing?
For reference my device is compute capability 3.5 and I am working with CUDA 5.5
.
Thank you.
Upvotes: 4
Views: 6036
Reputation: 21112
For an implementation of __ldg
that generalizes to arbitrary types and correctly falls back on compute capability less than 3.5, see the BryanCatanzaro/generics Github project.
Here is a bare bones template:
template<typename T>
__device__ __forceinline__ T ldg(const T* ptr) {
#if __CUDA_ARCH__ >= 350
return __ldg(ptr);
#else
return *ptr;
#endif
}
Upvotes: 7
Reputation: 152143
The __ldg()
intrinsic is only available on compute capability 3.5 (or newer) architecture.
That means:
That means:
nvcc -arch=sm_30 ...
nvcc -arch=sm_35 ...
nvcc -gencode arch=compute30,code=sm_30 -gencode arch=compute_35,code=sm_35 ...
Upvotes: 10