Reputation: 149
I need to process a 2-D array with dimensions K x N on the GPU, where K is a small number (3, 4, or 5) and N has a value of millions to 100s of millions. The processing will be done for one column of K elements at a time, such that each column will be processed by a separate invocation of a kernel. What is the most efficient way to represent the K x N array on the GPU:
1) in a 1-D array, placing the K elements of a column in consecutive locations, so that each thread will process elements K*thread_id, K*thread_id + 1, ..., K*thread_id + K - 1
2) as K separate 1-D arrays, where each array stores 1 row of the original array;
3) something else
Thank you!
Upvotes: 1
Views: 549
Reputation: 9781
The option 2 is better for your case.
The data layout of your option 2 can be seen as the structure of arrays (SoA), while the option 1 is the array of structures (AoS).
Generally the SoA is better than the AoS for GPU programming. There are a lot of discussion on this topic showing why SoA performs better.
http://developer.download.nvidia.com/CUDA/training/introductiontothrust.pdf
Since each thread accesses the K elements one by one, AoS layout in your option 1 leads to strided memory access issure and can hurt the performance, which is discussed as follows.
https://developer.nvidia.com/content/how-access-global-memory-efficiently-cuda-cc-kernels
Although this issue could be relaxed by a large enough L2 cache in your case, avoiding AoS is a more robust way to get higher performance.
Upvotes: 2