user1760748
user1760748

Reputation: 149

CUDA: how to represent efficiently 2-D arrays on the GPU

I need to process a 2-D array with dimensions K x N on the GPU, where K is a small number (3, 4, or 5) and N has a value of millions to 100s of millions. The processing will be done for one column of K elements at a time, such that each column will be processed by a separate invocation of a kernel. What is the most efficient way to represent the K x N array on the GPU:

1) in a 1-D array, placing the K elements of a column in consecutive locations, so that each thread will process elements K*thread_id, K*thread_id + 1, ..., K*thread_id + K - 1

2) as K separate 1-D arrays, where each array stores 1 row of the original array;

3) something else

Thank you!

Upvotes: 1

Views: 549

Answers (1)

kangshiyin
kangshiyin

Reputation: 9781

The option 2 is better for your case.

The data layout of your option 2 can be seen as the structure of arrays (SoA), while the option 1 is the array of structures (AoS).

Generally the SoA is better than the AoS for GPU programming. There are a lot of discussion on this topic showing why SoA performs better.

http://developer.download.nvidia.com/CUDA/training/introductiontothrust.pdf

http://my.safaribooksonline.com/book/-/9780123884268/chapter-6dot-efficiently-using-gpu-memory/st0045_b9780123884268000069

Since each thread accesses the K elements one by one, AoS layout in your option 1 leads to strided memory access issure and can hurt the performance, which is discussed as follows.

https://developer.nvidia.com/content/how-access-global-memory-efficiently-cuda-cc-kernels

Although this issue could be relaxed by a large enough L2 cache in your case, avoiding AoS is a more robust way to get higher performance.

Upvotes: 2

Related Questions