Reputation: 91
I have written a program for the string matching test,to test the performance vs cpu.
I just call the kernel by <<<1,1>>>
, one block which contains one thread, the execution time is 430ms, and then I use one block two threads <<<1,2>>>
to call the kernel, the execution time is 303ms, lastly I call the kernel by the <<<2,1><<
, two blocks and one thread each, and time is just half of 430ms (that is 215ms).
What's the difference between the thread in a block and a warp? What makes one block which contains two threads slower than two blocks one thread each?
Upvotes: 6
Views: 4436
Reputation: 72348
The first point to make is that the GPU requires hundreds or thousands of active threads to hide the architectures inherent high latency and fully utilise available arithmetic capacity and memory bandwidth. Benchmarking code with one or two threads in one or two blocks is a complete waste of time.
The second point is that there is no such thing as "thread in a block". Threads are fundamentally executed in warps of 32 threads. Blocks are composed of 1 or more warps, and grid of 1 or more blocks.
When you launch a grid containing a single block with one thread, you launch 1 warp. This warp contains 31 "dummy" threads which are masked off, and a single live thread. If you launch a single block with two threads, you still launch 1 warp, but now the single warp contains 2 active threads.
When you launch two blocks containing a single thread each, it results in two warps, each of which contains 1 active thread. Because all scheduling and execution is done on a per warp basis, you now have two separate entities (warps) which the hardware can schedule and execute independently. This allows more latency hiding and less instruction pipeline stalls, and the code runs faster as a result.
So the TLDR answer is 1 block = 1 warp, 2 blocks = 2 warps, the latter being less sub-optimal than the former.
Upvotes: 20