Reputation: 8618
I have a huge array that has to be read by different threads in parallel. Each thread has to read different entries at different places in the whole array, from start to finish. The buffer is read-only, so I don't think a "critical section" is required.
But I'm afraid that this approach has very bad performance. But I don't see an other way to do it. I could load the whole array in shared memory for each block, but I don't think there's enough shared memory for that.
Any ideas?
Edit: Some of you have asked me why I have to access different parts of the array, so here is some explanation: I'm trying to implement the "auction algorithm". In one kernel, each thread (person) has to bid on an item, which has a price, depending on its interest for that item. Each thread has to check its interest for a given object in a big array, but that is not a problem and I can coalesce the reading in shared memory. The problem is when a thread has chosen to bid for an item, it has to check its price first, and since there are many many objects to bid for, I can't bring all this info into shared memory. Moreover, each thread has to access the whole buffer of prices since they can bid on any object. My only advantage here is that the buffer is read-only.
Upvotes: 6
Views: 3142
Reputation: 6753
Reading from the shared memory is much faster when compared to reading from global memory. Maybe you can load a subset of the array to the shared memory that is required by threads in the block. If the threads in a block require values from vastly different parts of the array, you should change your algorithm as that leads to non coallesced access which is slow.
moreover, while reading from shared memory, be careful of bank conflicts which occurs when two threads read from the same bank in shared memory. Texture memory may also be a good choice because it is cached
Upvotes: 1
Reputation: 666
The fastest way to access global memory is via coalesced access, however in your case this may not be possible. You could investigate texture memory which is read only, though usually used for spatial 2D access.
Section 3.2 of the Cuda Best practice guide has great information about this and other memory techniques.
Upvotes: 4