user3089852
user3089852

Reputation: 41

Optimizing a CUDA kernel with irregular memory accesses

I have the following CUDA kernel which seems very "tough" to optimize:

__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai )
{
    for(int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < filter_size; idx+=blockDim.x * gridDim.x)
    {
        int index = (idx * ai) & (n-1);
        d_origx_remap[idx] = d_origx[index];
    }
}

//Parameters were defined before
int permute[loops] = {29165143,3831769,17603771,9301169,32350975, ...}
int n = 33554432;
int filter_size = 1783157;

for(int i=0; i<loops; i++)
{
    DataLayoutTransformKernel<<<dimGrid, dimBlock, 0, stream[i]>>>((cuDoubleComplex*) d_origx,(cuDoubleComplex*)d_origx_remap+i*filter_size, n, filter_size, permute[i]);

}

The purpose of the kernel is to reorder the data layout of d_origx[] from irregular to regular (d_origx_remap). The kernel is launched several times with different access strides (ai).

The challenge here is the irregular memory access pattern in referring the array of d_origx[index]. My idea was to use shared memory. But for this case it seems very hard to use shared memory to coalesce global memory access.

Does anyone have suggestions on how to optimize this kernel?

Upvotes: 4

Views: 959

Answers (2)

Vitality
Vitality

Reputation: 21455

I'm not sure you can do much to optimize your code.

There is not at all thread cooperations, so I would say that shared memory is not the way to go.

You may try changing

__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai)

to

__global__ void DataLayoutTransformKernel(const cuDoubleComplex* __restrict__ d_origx, cuDoubleComplex* __restrict__ d_origx_remap, const int n, const int filter_size, const int ai)

i.e., using the const and __restrict__ keywords. Particularly __restrict__ will enable nvcc to perform some optimizations, see Section B.2 of the CUDA C Programming Guide. For the Kepler architecture, the const and __restrict keyword may be tagged by the compiler to be loaded through the Read‐Only Data Cache, see the Kepler architecture whitepaper.

Upvotes: 1

Maddy Scientist
Maddy Scientist

Reputation: 129

The Trove library is a CUDA/C++ library with support for AoS support, and likely gives close to optimal performance for random AoS access. From the GitHub page it looks like trove will get about 2x over the naive approach for 16-byte structures.

https://github.com/BryanCatanzaro/trove

Random access performance using Trove compared to the naive direct access approach

Upvotes: 5

Related Questions