CUDA thread local array

Question

I am writing a CUDA kernel that requires maintaining a small associative array per thread. By small, I mean 8 elements max worst case, and an expected number of entries of two or so; so nothing fancy; just an array of keys and an array of values, and indexing and insertion happens by means of a loop over said arrays.

Now I do this by means of thread local memory; that is identifier[size]; where size is a compile time constant. Now ive heard that under some circumstances this memory is stored off-chip, and under some circumstances it is stored on-chip. Obviously I want the latter, under all circumstances. I understand that I can accomplish such with a block of shared mem, where I let each thread work on its own private block; but really? I dont want to share anything between threads, and it would be a horrible kludge.

What exactly are the rules for where this memory goes? I cant seem to find any word from nvidia. For the record, I am using CUDA5 and targetting Kepler.

tera · Accepted Answer

Local variables are either stored in registers, or (cached for compute capability >=2.0) off-chip memory.

Registers are only used for arrays if all array indices are constant and can be determined at compile time, as the architecture has no means for indexed access to registers at runtime.

In you case the number of keys may be small enough to use registers (and tolerate the increase in register pressure). Unroll all loops over array accesses to allow the compiler to place the keys in registers, and use cuobjdump -sass to check it actually does.

If you don't want to spend registers, you can either choose shared memory with a per-thread offset (but check that the additional registers used to hold per-thread indices into shared memory don't outvalue the number of keys you use) as you mentioned, or do nothing and use off-chip "local" memory (really "global" memory with just a different addressing scheme) hoping for the cache to do it's work.

If you hope for the cache to hold the keys and values, and don't use much shared memory, it may be beneficial to select the 48kB cache / 16kB shared memory setting over the default 16kB/48kB split using cudaDeviceSetCacheConfig().

CUDA thread local array

Answers (1)

Related Questions