Should I use pitched memory in CUDA for read-only 2D arrays?

Question

I am porting some code from CPU to GPU, and in CPU side I have a dynamically allocated matrix (double **) which is to be ported to GPU. However, once initialized, matrix is never modified. Since I can't use pointers to pointers on GPU, should I represent this matrix as a flat array (double * accessed as matrix[i * nCols + j]) or use pitched memory for it? Will the use of pitched memory lead to performance improvement in this case?

Christian Sarofeen · Accepted Answer

The only instance I can think of that using pitched memory could perform worse for a 2D array instead of linear memory is if you directly access the memory using:

int tid = blockIdx.x * blockDim.x + threadIdx.x;
double myVal=_d_array[tid];

Otherwise, pitch will at the least align the first entry of each row. A read through: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#coalesced-access-to-global-memory will most definitely help your understanding. If your rows are small (~16 entries), or you're using a 2.x compute capability card you could see significant performance improvements when you access data row by row with pitch instead of linear layout.

Worst case without pitch for row by row with a 2.x capability card could be close to 50% bandwidth for an unaligned grab of 16 double values. This could also thrash your L1 cache pretty bad, as that will boot out an extra L1 cache line.

Due to non L1 caching in 3.x an unaligned grab of 16 doubles will result in 32B*5 grab into L2 instead of 32B*4 so the performance hit will likely be "small".

One thing to keep in mind is making block sizes multiples 32 is typically a good idea.

Should I use pitched memory in CUDA for read-only 2D arrays?

Answers (1)

Related Questions