Chaitanya Turubati
Chaitanya Turubati

Reputation: 53

cuda constant memory reference

I have an array in constant memory (it is a global variable) and obtained the reference to it by function call cudaGetSymbolAddress. My kernel runs slowly when I use this reference to fetch the constant data rather than using the global variable. What is the reason for this?

__constant__ int g[2] = {1,2};
// __device__ int g[2] = {1,2};

// kernel: use by reference
__global__ void add_1( int *a, int *b, int *c, int *f )
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    c[tid] = f[0] * a[tid] + f[1] * b[tid];
}

// kernel: use global variable
__global__ void add_2( int *a, int *b, int *c, int *f )
{
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    c[tid] = g[0] * a[tid] + f[1] * b[tid];
}

int main()
{
    ......
    // a,b,c are large arrays in device memory of size 40960.

    int *f;
    cudaGetSymbolAddress( (void **)&f, (char *)&g);

    add_1 <<< 160, 256 >>> ( a, b, c, f );

    ......
}

This is the sample code and all threads in warp load same location at same time. The commented code is by directly accessing constant memory

Explanation for why constant memory cache is not used (by talonmies)

The reason is the lack of constant cache. Cached access only occurs when the compiler emits a specific PTX instruction (ld.const) on a variable explicit marked as being in the constant state space. And the way the compiler knows to do this is when a variable is declared __constant__ -- it is a static, compile time attribute which effects code generation. The same process can't happen at runtime.

If you pass a pointer in global memory and the compiler can't determine that the pointer in the constant state space, it won't generate the correct PTX to access that memory via the constant cache. Access will be slower as a result.

Unanswered Question

Why even when array g is declared as __device__ variable, the code is slower when reference to it used. By seeing the PTX code, for loading the global memory to registers:

What is the difference and any documentation reference would be appreciated?

Upvotes: 4

Views: 2523

Answers (1)

tera
tera

Reputation: 7265

Unlike global memory, accesses to constant memory will get serialized (split into multiple transactions) if they are not uniform (all threads of a (half- for compute capability 1.x) warp access the same address.

So only use constant memory if the accesses are likely to be uniform.

Upvotes: 2

Related Questions