Reputation: 2519
I'm wondering whether L2 cache is freed between multiple kernel invocations. For example I have a kernel that does some preprocessing on data and the second one that uses it. Is it possible to achieve greater performance if data size is less than 768 KB? I see no reason for NVidia guys to implement it otherwise but maybe I'm wrong. Does anybody have an experience with that?
Upvotes: 4
Views: 1271
Reputation: 2053
Assuming you are talking about L2 data cache in Fermi.
I think the caches are flushed after each kernel invocation. In my experience, running two consecutive launches of the same kernel with a lots of memory accesses (and #L2 cache misses) doesn't make any substantial changes to the L1/L2 cache statistics.
In your problem, I think, depending on the data dependency, it is possible to put two stages into one kernel (with some sync) so the second part of the kernel can reuse the data processed by the first part.
Here is another trick: You know the gpu has, for example N SMs, you can perform the first part using the first N * M1 blocks. The next N * M2 blocks for the second part. Make sure all the blocks in the first part finish at the same time (or almost) using sync. In my experience, the block scheduling order is really deterministic.
Hope it helps.
Upvotes: 2