interestedparty333
interestedparty333

Reputation: 2536

How do you keep data in fast GPU memory (l1/shared) across kernel invocations?

How do you keep data in fast GPU memory across kernel invocations?

Let's suppose, I need to answer 1 million queries, each of which has about 1.5MB of data that's reusable across invocations and about 8KB of data that's unique to each query.

One approach is to launch a kernel for each query, copying the 1.5MB + 8KB of data to shared memory each time. However, then I spend a lot of time just copying 1.5MB of data that really could persist across queries.

Another approach is to "recycle" the GPU threads (see https://stackoverflow.com/a/49957384/3738356). That involves launching one kernel that immediately copies the 1.5MB of data to shared memory. And then the kernel waits for requests to come in, waiting for the 8KB of data to show up before proceeding with each iteration. It really seems like CUDA wasn't meant to be used this way. If one just uses managed memory, and volatile+monotonically increasing counters to synchronize, there's still no guarantee that the data necessary to compute the answer will be on the GPU when you go to read it. You can seed the values in the memory with dummy values like -42 that indicate that the value hasn't yet made its way to the GPU (via the caching/managed memory mechanisms), and then busy wait until the values become valid. Theoretically, that should work. However, I had enough memory errors that I've given up on it for now, and I've pursued....

Another approach still uses recycled threads but instead synchronizes data via cudaMemcpyAsync, cuda streams, cuda events, and still a couple of volatile+monotonically increasing counters. I hear I need to pin the 8KB of data that's fresh with each query in order for the cudaMemcpyAsync to work correctly. But, the async copy isn't blocked -- its effects just aren't observable. I suspect with enough grit, I can make this work too.

However, all of the above makes me think "I'm doing it wrong." How do you keep extremely re-usable data in the GPU caches so it can be accessed from one query to the next?

Upvotes: 2

Views: 771

Answers (1)

Barış Tanyeri
Barış Tanyeri

Reputation: 130

  • First of all to observe the effects of the streams and Async copying you definitely need to pin the host memory. Then you can observe concurrent kernel invocations "almost" happening at the same time. I'd rather used Async copying since it makes me feel in control of the situation.
  • Secondly you could just hold on to the data in global memory and load it in the shared memory whenever you need it. To my knowledge shared memory is only known to the kernel itself and disposed after termination. Try using Async copies while the kernel is running and synchronize the streams accordingly. Don't forget to __syncthreads() after loading to the shared memory. I hope it helps.

Upvotes: 1

Related Questions