Reputation: 5186
I have a heavy performance issue when running this code on NVIDIA K5000:
while ( atomicMax(&iThreadSemaphore, 0) )
;
On a GTX 650 ti or K2000 the device function is executed in ~2900 msec including the above code. On a K5000 the exactly same device function is executed in ~5000 msec. When I remove the while
loop then the K5000 executes the device function in ~900 msec which is 1/3 of the K2000 and OK!
Has anybody an idea why the atomicMax()
function slows the K5000 that much down?
I could definitely isolate the problem code - it is the while
loop.
Thank you.
Upvotes: 1
Views: 110
Reputation: 72335
Basically your problem sounds like a straightforward example of scalability limitations in your code.
The K5000 has 8 multiprocessors, whereas the others have only 4 multiprocessors. Given that you say you are running 147 blocks (which is more than enough to complete fill all of the GPUs during execution), you will have a situation where the K5000 will have about twice as many threads in flight at one time as the K2000 or GF650Ti. From the extremely limited description of your code, that would mean that you have twice as many threads contending for the same semaphore atomically. The more contention you have, the slower the code will be. I would expect twice as much contention for the same atomic resource to be at least twice as slow.
In summary, it would appear that there is nothing wrong with your K5000, except that it is large enough to expose serious scalability problems with your code.
Upvotes: 1