Reputation: 41
I have the following CUDA kernel which seems very "tough" to optimize:
__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai )
{
for(int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < filter_size; idx+=blockDim.x * gridDim.x)
{
int index = (idx * ai) & (n-1);
d_origx_remap[idx] = d_origx[index];
}
}
//Parameters were defined before
int permute[loops] = {29165143,3831769,17603771,9301169,32350975, ...}
int n = 33554432;
int filter_size = 1783157;
for(int i=0; i<loops; i++)
{
DataLayoutTransformKernel<<<dimGrid, dimBlock, 0, stream[i]>>>((cuDoubleComplex*) d_origx,(cuDoubleComplex*)d_origx_remap+i*filter_size, n, filter_size, permute[i]);
}
The purpose of the kernel is to reorder the data layout of d_origx[]
from irregular to regular (d_origx_remap
). The kernel is launched several times with different access strides (ai
).
The challenge here is the irregular memory access pattern in referring the array of d_origx[index]
. My idea was to use shared memory. But for this case it seems very hard to use shared memory to coalesce global memory access.
Does anyone have suggestions on how to optimize this kernel?
Upvotes: 4
Views: 959
Reputation: 21455
I'm not sure you can do much to optimize your code.
There is not at all thread cooperations, so I would say that shared memory is not the way to go.
You may try changing
__global__ void DataLayoutTransformKernel(cuDoubleComplex* d_origx, cuDoubleComplex* d_origx_remap, int n, int filter_size, int ai)
to
__global__ void DataLayoutTransformKernel(const cuDoubleComplex* __restrict__ d_origx, cuDoubleComplex* __restrict__ d_origx_remap, const int n, const int filter_size, const int ai)
i.e., using the const
and __restrict__
keywords. Particularly __restrict__
will enable nvcc
to perform some optimizations, see Section B.2 of the CUDA C Programming Guide. For the Kepler architecture, the const
and __restrict
keyword may be tagged by the compiler to be loaded through the Read‐Only Data Cache, see the Kepler architecture whitepaper.
Upvotes: 1
Reputation: 129
The Trove library is a CUDA/C++ library with support for AoS support, and likely gives close to optimal performance for random AoS access. From the GitHub page it looks like trove will get about 2x over the naive approach for 16-byte structures.
https://github.com/BryanCatanzaro/trove
Upvotes: 5