Why use blocks\grid instead of for-loop?

Question

Why use:

kernel<<<512, 512>>>( ); //somewhere
__device__ void kernel( ) {
    Code( );
}

Rather than:

kernel<<<1, 512>>>( 512 ); //somewhere
__device__ void kernel( int n ) {
    for ( int i = 0 ; i < n ; ++i ) {
        Code( );
    }
}

NOTE: I don't have CUDA GPU yet to check it.

Is the first somehow faster? GPU Cores can't handle long-running threads or loses it's speed while running longer?

I guess the second (for-loop) is better when the number of the desired iteration are not aligned to the number of thread. (We can change the n variable in the last thread\core)

Kerrek SB · Accepted Answer

The very idea of CUDA is that you should do parallel work in parallel. The entire execution architecture is designed to make that fast. Anything which is truly parallel, i.e. where all parallel pieces of the logic execute the exact same logic in lockstep, are better done by executing the same instructions on many, many cores at once, rather than executing many instructions with complicated branching and looping on one core.

I recommend you read the extensive documentation published about CUDA by Nvidia, with special attention to warps, bank conflicts, local memory, branching, etc. Programming for GPU is not at all trivial, and like with any kind of concurrent programming, you should expect the process to be painful and expensive unless you have both an extremely good reason to be concurrent and also understand lots of the low-level details very well.

Why use blocks\grid instead of for-loop?

Answers (2)

Related Questions