Reputation: 31
I understand the bank conflict when dealing with 4-byte data types, but I wonder if we get any bank conflict (4-way/8-way?) with the following code
__shared__ char shared[];
foo = shared[threadIdx.x];
The above code leads to 4 consecutive threads in a warp accessing the same word address in the same bank.
Will a similar memory access pattern lead to a bank conflict for any CUDA device family? Apparently, it does so for old cards only but I want to confirm.
My question can be generalized further: What if a number of threads access the same bank addressable unit [8-byte or 4-byte] but each requires a fraction of it. will the hardware handle such requests without any bank conflict? Thank you
Upvotes: 3
Views: 430
Reputation: 151869
All cc2.0 and newer GPU devices have a broadcast mechanism such that for any number of threads participating in a warp request which are accessing a given 32-bit aligned location or any portion of that location (or multiple groups of such threads, each group accessing a given 32-bit aligned location or any portion of that location), the threads in that group will be serviced in a single transaction, without serialization.
From the documentation:
A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads (multiple words can be broadcast in a single transaction) and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).
For the devices that support 8-byte-bank-mode, and which are in 8-byte-bank-mode, the above broadcast mechanism generalizes to a 64-bit aligned location.
Note that I've chosen my wording here carefully. Suppose I have two such broadcast groups in a single warp request. Now also suppose that those groups are addressing two different locations, but two locations in the same bank. E.g. group A is targetting address 0, and group B is targetting address 1024. In this case, all threads participating in group A will be serviced in a single transaction, and all threads participating in group B will be serviced in a single transaction, but those two groups will be serialized with respect to each other.
On the other hand, if group targetted address 0 and group B targetted address 8, those are in separate banks, so all of the threads in group A and all of the threads in group B could be serviced in a single transaction, because multiple broadcasts per request are permitted.
Upvotes: 2