Reputation: 8608
I want to fill the global memory with as much data as possible, and I know each data I have is a value between 0 and 255. So I was thinking that instead of using an int type I could store their value in a short, or even better in a char, and access and compute the values on the device using the same type.
However, will this affect performance on a Tesla architecture? And will the copy from global to shared memory be coalesced?
Any ideas? Thanks!
Upvotes: 2
Views: 392
Reputation: 72344
The best strategy for optimising bandwidth utilization for 8 or 16 bit types will depend on the access patterns of the memory within a warp. If any given warp will read in a highly scattered fashion, then using byte sizes types and forcing 32 bit transaction sizes per warp might be an advantage, because the standard transaction sizes will waste a lot of the achieved bandwidth. But if the access patterns within a warp are going to hit contiguous or near-contiguous segments of memory, then the best strategy will probably be to use uchar4
or ushort2
and aim to achieve throughput using coalesced 32 bit reads per thread from global memory to a shared memory array and have all the threads in a block read from that. It might also be worth evaluating the performance of a texture for loads if the per warp accesses contain scatter.
This pdf contains a lot of useful information about memory performance optimisation on Fermi. I would encourage you to spend some time reading it. Benchmarking is really the only way of evaluating the best approach for a given application, and that document has some excellent tips on how to understand the memory performance of a given piece of code.
Upvotes: 1