Joseph Lenox
Joseph Lenox

Reputation: 45

CUDA Coalescing performance of small data types (Fermi, Kepler)?

While looking through a copy of The CUDA Handbook by Nicholas Wilt, I noticed that apparently 1-byte and 2-byte memory transactions are not coalesced. However, it was my understanding that Fermi and Kepler (SM2+) architectures fetched the number of cache lines required to satisfy memory. To me, that sounds like coalescing.

My application, to save space, was making heavy use of 1 and 2-byte data fields (in large 2D pitch-linear arrays) and hammering global memory.

I went ahead and made the changes to my application to have a thread fetch 4 entries at once by simply unioning an unsigned integer with four unsigned chars and fetching the union.

Running it on so`me of my test data, I'm seeing an improvement of ~32% on a Kepler laptop card (750M, SM3.5) and ~45% on a Tesla C2075 (SM2.0)

What's the more likely reason for this improvement?? Increased per-thread work, fewer overall memory fetch requests, or was I incorrect in my understanding of how coalescing worked for small data types?

Upvotes: 1

Views: 591

Answers (1)

harrism
harrism

Reputation: 27809

Your understanding of Coalescing on Fermi and Kepler is more or less correct. They fetch the number of cache lines required to satisfy all load requests in the warp.

First, given your speedup numbers, I conclude that your test is bandwidth bound.

If threads in a warp are loading contiguous bytes, that is 32 bytes needed per warp. The cache line size is 128 bytes, which means each warp is only utilizing 25% of the bandwidth it could get. But it also means each warp's loads should be reused by 3 other warps, assuming they are not evicted from the cache first. But that is neither hear nor there, because if you have enough threads, and the loads are fully coalesced, you can probably hide most of the latency even without the cache.

By fetching 4 bytes per thread instead, you get lower cache reuse, but you utilize more of the available bandwidth, which is likely why you see a speedup.

Since your test is bandwidth bound, you may get even more speedup by loading a uint2 or uint4 per thread (8 or 16 bytes). The reason is that typically you need more than one cache line request per warp in flight in order to fully saturate the memory bandwidth. So I would try that too.

Upvotes: 2

Related Questions