Which latency is longer between the two situations below The data is filled into shared memory from global memory, and all threads access shared memory concurrently. The data may be the same for multiple threads accessing it. All threads access global memory, but the data are neighbors.

cudalatencygpu-shared-memory

taoyuanjl

Reputation: 163

The latency of acessing shared memory

Which latency is longer between the two situations below

The data is filled into shared memory from global memory, and all threads access shared memory concurrently. The data may be the same for multiple threads accessing it.
All threads access global memory, but the data are neighbors.

Upvotes: 0

Answers (2)

Roger Dahl

Reputation: 15724

If you plan on accessing each value only once, then you won't gain anything from using shared memory.

Values in shared memory are only valid within a block, so one or more threads in each block will have to load the values from global memory. So you're not able to avoid the global memory accesses.

If you have a device of compute capability >= 2.0 (Fermi), values read from global memory are automatically cached in the L1 and L2 caches. L1 has the same latency as shared memory.

Latency is a fixed value that depends on which memory you're accessing. It doesn't change. Latency is always much lower for shared memory than for global memory.

I think what you might really be asking is what type of access would give you the best memory throughput. If you will be using each value only once, case (2) will give the best throughput. If you will be reusing values and have CC >= 2.0, letting L1 handle the caching is likely to give the best throughput. If you're reusing values on CC < 2.0, using shared memory will give the best throughput.

Case (1) may or may not cause bank conflicts but will give better throughput regardless, for values that are already stored in shared memory.

Case (2) describes the optimal access pattern for global memory.

Upvotes: 4

kasavbere

Reputation: 6003

Perhaps I don't understand the difference between the two case. But if I do:

The second is faster if your hardware architecture allows it. For example, on a multicore machine with parallel registers. Notice also that in the second case, even from a pure software viewpoint, the data does not need to be made thread-safe for such fears as race-conditions due to interleaving.

Think of it like this:

CASE 2:

you have a large table with five dinners, and you have five kids to eat them: no synchronization needed.

CASE 1:

You have, say, three tables with three dinners; so that two kids may have to eat from the same plate and thus may need to synchronize their movements so they don't hit each other. Synchronization means delay.

Upvotes: 0

The latency of acessing shared memory

Which latency is longer between the two situations below

Answers (2)

Related Questions