user2052436
user2052436

Reputation: 4765

How randomly accessing small constant arrays from CUDA kernel works

My kernel uses float array of size 8 by 8 with random access pattern below.

// inds - big array of indices in range 0,...,7
// flts - 8 by 8 array of floats

// kernel essentially processes large 2D array by looping through slow coordinate
// and having block/thread parallelization of fast coordinate.

__global__ void kernel (int const* inds, float const* flt, ...)
{
    int idx = threadIdx.x + blockDim.x * blockIdx.x;  // Global fast coordinate
    int idy;                                          // Global slow coordinate
    int sx = gridDim.x * blockDim.x;                  // Stride of fast coordinate

    for ( idy = 0; idy < 10000; idy++ )       // Slow coordinate loop
    {
        int id = idx + idy * sx;              // Global coordinate in 2D array

        int ind = inds[id];                   // Index of random access to small array

        float f0 = flt[ind * 8 + 0];
        ...
        float f7 = flt[ind * 8 + 7];

        NEXT I HAVE SOME ALGEBRAIC FORMULAS THAT USE f0, ..., f7
    }
}

What would be the best way to access flt array?

  1. Do not pass flt, use __const__ memory. I am not sure how fast const memory is when different threads access different data.
  2. Use as above. Load uniform will not be used because threads access different data. Will it nevertheless be fast because of cache?
  3. Copy into shared memory and use shared memory array.
  4. Use textures. Never used textures... Can this approach be fast?

For shared memory, it is probably better to transpose flt array, i.e. access it this way to avoid bank conflicts:

float fj = flt_shared[j * 8 + ind]; // where j = 0, ..., 7

PS: Target architectures are Fermi and Kepler.

Upvotes: 2

Views: 1706

Answers (1)

Vitality
Vitality

Reputation: 21475

The "best" way depends also on the architecture you are working on. My personal experience with random access (your access seems to be sort of a random due to the use of the mapping inds[id]) on Fermi and Kepler is that L1 is now so fast that in many cases it is better to keep using global memory instead of shared memory or texture memory.

Accelerating global memory random access: Invalidating the L1 cache line

Fermi and Kepler architectures support two types of loads from global memory. Full caching is the default mode, it attempts to hit in L1, then L2, then GMEM and the load granularity is 128-byte line. L2-only attempts to hit in L2, then GMEM and the load granularity is 32-bytes. For certain random access patterns, memory efficiency can be increased by invalidating L1 and exploiting the lower granularity of L2. This can be done by compiling with –Xptxas –dlcm=cg option to nvcc.

General guidelines for accelerating global memory access: disabling ECC support

Fermi and Kepler GPUs support Error Correcting Code (ECC), and ECC is enabled by default. ECC reduces peak memory bandwidth and is requested to enhance data integrity in applications like medical imaging and large-scale cluster computing. If not needed, it can be disabled for improved performance using the nvidia-smi utility on Linux (see the link), or via Control Panel on Microsoft Windows systems. Note that toggling ECC on or off requires a reboot to take effect.

General guidelines for accelerating global memory access on Kepler: using read-only data cache

Kepler features a 48KB cache for data that is known to be read‐only for the duration of the function. Use of the read‐only path is beneficial because it offloads the Shared/L1 cache path and it supports full speed unaligned memory access. Use of the read‐only path can be managed automatically by the compiler (use the const __restrict keyword) or explicitly (use the __ldg() intrinsic) by the programmer.

Upvotes: 1

Related Questions