Understanding Warp Parallelism (Fermi)

Question

I have the following kernel where, each thread (1D Grid, 1D blocks) processes just one element of the input array.

__global__ void normalize_fft_result(double *u_device, int n0)
{
    //Use 1d data mapping;
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    if (tid < n0)
        {
            //Normalize Result
            u_device[tid] = u_device[tid] / float(n0);
        }
}

I'm running this on a Fermi GPU, where i found out that the cache line which the processor loads data into the L1 cache is 128 Bytes long. I'm working with doubles which are 8 bytes, which means that in one single transaction only half of the threads within a warp have their instruction operands available (128/8=16). And this means that a warp in order to get the data for the other half of the threads needs another 128 B transaction.

Threads within a warp are supposed to be executed concurrently, so what exactly happens during the wait for the 2nd transaction? Do the first 16 threads wait for the last 16, or do they execute the instruction while the others wait for their operands?

In any case, doesn't this data-wait produce unavoidable latency?

Greg Smith · Accepted Answer

The warp scheduler will replay the instruction until all threads have completed the memory load or store. On CC2.x devices a 64-bit load is done by issuing the first 16 threads and then the second. Additional replays will be performed if there is additional address divergence (e.g. each threads reads a separate cache line) and for each cache miss. On CC2.x devices additional independent instructions from the warp can issue after all threads from the load or store instruction have completed.

Additional information regarding replays for global, local, and shared memory can be found in the CUDA Programming section on Compute Capability 2.x

Understanding Warp Parallelism (Fermi)

Answers (1)

Related Questions