charis
charis

Reputation: 439

How to avoid bank conflicts when loading data from global to shared memory

A problem involves strided accesses to an unsigned char array stored in global memory of a compute capability 1.3 GPU. In order to bypass the coalescence requirements of the global memory, the threads access sequentially the global memory and copy the array to the shared memory using only 2 memory transactions for the following example:

#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>

__global__ void kernel ( unsigned char *d_text, unsigned char *d_out ) {

    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    extern __shared__ unsigned char s_array[];

    uint4 *uint4_text = ( uint4 * ) d_text;
    uint4 var;

    //memory transaction
    var = uint4_text[0];

    uchar4 c0 = *reinterpret_cast<uchar4 *>(&var.x);
    uchar4 c4 = *reinterpret_cast<uchar4 *>(&var.y);
    uchar4 c8 = *reinterpret_cast<uchar4 *>(&var.z);
    uchar4 c12 = *reinterpret_cast<uchar4 *>(&var.w);

    s_array[threadIdx.x*16 + 0] = c0.x;
    s_array[threadIdx.x*16 + 1] = c0.y;
    s_array[threadIdx.x*16 + 2] = c0.z;
    s_array[threadIdx.x*16 + 3] = c0.w;

    s_array[threadIdx.x*16 + 4] = c4.x;
    s_array[threadIdx.x*16 + 5] = c4.y;
    s_array[threadIdx.x*16 + 6] = c4.z;
    s_array[threadIdx.x*16 + 7] = c4.w;

    s_array[threadIdx.x*16 + 8] = c8.x;
    s_array[threadIdx.x*16 + 9] = c8.y;
    s_array[threadIdx.x*16 + 10] = c8.z;
    s_array[threadIdx.x*16 + 11] = c8.w;

    s_array[threadIdx.x*16 + 12] = c12.x;
    s_array[threadIdx.x*16 + 13] = c12.y;
    s_array[threadIdx.x*16 + 14] = c12.z;
    s_array[threadIdx.x*16 + 15] = c12.w;

    d_out[idx] = s_array[threadIdx.x*16];
}

int main ( void ) {

    unsigned char *d_text, *d_out;

    unsigned char *h_out = ( unsigned char * ) malloc ( 32 * sizeof ( unsigned char ) );
    unsigned char *h_text = ( unsigned char * ) malloc ( 32 * sizeof ( unsigned char ) );

    int i;

    for ( i = 0; i < 32; i++ )
        h_text[i] = 65 + i;

    cudaMalloc ( ( void** ) &d_text, 32 * sizeof ( unsigned char ) );
    cudaMalloc ( ( void** ) &d_out, 32 * sizeof ( unsigned char ) );

    cudaMemcpy ( d_text, h_text, 32 * sizeof ( unsigned char ), cudaMemcpyHostToDevice );

    kernel<<<1,32,16128>>>(d_text, d_out );

    cudaMemcpy ( h_out, d_out, 32 * sizeof ( unsigned char ), cudaMemcpyDeviceToHost );

    for ( i = 0; i < 32; i++ )
        printf("%c\n", h_out[i]);

    return 0;
}

The problem is that bank conflicts occur when copying the data to shared memory (384 conflicts for the above example as reported by nvprof) that lead to serialized accesses of the threads.

The shared memory is divided into 16 (or 32 on newer device architectures) 32-bit banks in order to simultaneously serve the 16 threads of the same half-warp. The data are interleaved between the banks with the ith 32-bit word always being stored in the i % 16 - 1 shared memory bank.

Since each threads reads 16 bytes with one memory transaction, the characters will be stored in a strided fashion to the shared memory. This results in conflicts between threads 0, 4, 8, 12; 1, 5, 9, 13; 2, 6, 10, 14; 3, 7, 11, 15 of the same half-warp. A naive way to eliminate bank conflicts would be to use if/else branching to store the data to shared memory in a round-robin fashion similar to the following, but resulting to some serious thread divergence:

int tid16 = threadIdx.x % 16;

if ( tid16 < 4 ) {

    s_array[threadIdx.x * 16 + 0] = c0.x;
    s_array[threadIdx.x * 16 + 1] = c0.y;
    s_array[threadIdx.x * 16 + 2] = c0.z;
    s_array[threadIdx.x * 16 + 3] = c0.w;

    s_array[threadIdx.x * 16 + 4] = c4.x;
    s_array[threadIdx.x * 16 + 5] = c4.y;
    s_array[threadIdx.x * 16 + 6] = c4.z;
    s_array[threadIdx.x * 16 + 7] = c4.w;

    s_array[threadIdx.x * 16 + 8] = c8.x;
    s_array[threadIdx.x * 16 + 9] = c8.y;
    s_array[threadIdx.x * 16 + 10] = c8.z;
    s_array[threadIdx.x * 16 + 11] = c8.w;

    s_array[threadIdx.x * 16 + 12] = c12.x;
    s_array[threadIdx.x * 16 + 13] = c12.y;
    s_array[threadIdx.x * 16 + 14] = c12.z;
    s_array[threadIdx.x * 16 + 15] = c12.w;

} else if ( tid16 < 8 ) {

    s_array[threadIdx.x * 16 + 4] = c4.x;
    s_array[threadIdx.x * 16 + 5] = c4.y;
    s_array[threadIdx.x * 16 + 6] = c4.z;
    s_array[threadIdx.x * 16 + 7] = c4.w;

    s_array[threadIdx.x * 16 + 8] = c8.x;
    s_array[threadIdx.x * 16 + 9] = c8.y;
    s_array[threadIdx.x * 16 + 10] = c8.z;
    s_array[threadIdx.x * 16 + 11] = c8.w;

    s_array[threadIdx.x * 16 + 12] = c12.x;
    s_array[threadIdx.x * 16 + 13] = c12.y;
    s_array[threadIdx.x * 16 + 14] = c12.z;
    s_array[threadIdx.x * 16 + 15] = c12.w;

    s_array[threadIdx.x * 16 + 0] = c0.x;
    s_array[threadIdx.x * 16 + 1] = c0.y;
    s_array[threadIdx.x * 16 + 2] = c0.z;
    s_array[threadIdx.x * 16 + 3] = c0.w;

} else if ( tid16 < 12 ) {

    s_array[threadIdx.x * 16 + 8] = c8.x;
    s_array[threadIdx.x * 16 + 9] = c8.y;
    s_array[threadIdx.x * 16 + 10] = c8.z;
    s_array[threadIdx.x * 16 + 11] = c8.w;

    s_array[threadIdx.x * 16 + 12] = c12.x;
    s_array[threadIdx.x * 16 + 13] = c12.y;
    s_array[threadIdx.x * 16 + 14] = c12.z;
    s_array[threadIdx.x * 16 + 15] = c12.w;

    s_array[threadIdx.x * 16 + 0] = c0.x;
    s_array[threadIdx.x * 16 + 1] = c0.y;
    s_array[threadIdx.x * 16 + 2] = c0.z;
    s_array[threadIdx.x * 16 + 3] = c0.w;

    s_array[threadIdx.x * 16 + 4] = c4.x;
    s_array[threadIdx.x * 16 + 5] = c4.y;
    s_array[threadIdx.x * 16 + 6] = c4.z;
    s_array[threadIdx.x * 16 + 7] = c4.w;

} else {

    s_array[threadIdx.x * 16 + 12] = c12.x;
    s_array[threadIdx.x * 16 + 13] = c12.y;
    s_array[threadIdx.x * 16 + 14] = c12.z;
    s_array[threadIdx.x * 16 + 15] = c12.w;

    s_array[threadIdx.x * 16 + 0] = c0.x;
    s_array[threadIdx.x * 16 + 1] = c0.y;
    s_array[threadIdx.x * 16 + 2] = c0.z;
    s_array[threadIdx.x * 16 + 3] = c0.w;

    s_array[threadIdx.x * 16 + 4] = c4.x;
    s_array[threadIdx.x * 16 + 5] = c4.y;
    s_array[threadIdx.x * 16 + 6] = c4.z;
    s_array[threadIdx.x * 16 + 7] = c4.w;

    s_array[threadIdx.x * 16 + 8] = c8.x;
    s_array[threadIdx.x * 16 + 9] = c8.y;
    s_array[threadIdx.x * 16 + 10] = c8.z;
    s_array[threadIdx.x * 16 + 11] = c8.w;
}

Anyone could come up with a better solution to this? I' ve already studied the reduction example of the SDK but i am not sure it is applicable to my problem.

Upvotes: 1

Views: 2407

Answers (3)

Mikhail  M
Mikhail M

Reputation: 960

To avoid bank conflicts people often add spare column to the virtual matrix in shared memory. So you maybe can increase your shared array size by 1/16 and replace with

threadIdx.x * 17 + 0
threadIdx.x * 17 + 1
...
threadIdx.x * 17 + 15

For 1D arrays it can be

s_array[idx + idx / 16] = source[idx];

Upvotes: 1

tera
tera

Reputation: 7255

Granted the code leads to bank conflicts, but that doesn't mean it is any slower.

On your compute capability 1.3 GPU a shared memory transaction with a 2-way bank conflict takes just two more cycles than one without a bank conflict. In two cycles you can't even execute a single instruction to work around the bank conflict. A 4-way bank conflict uses six more cycles compared to a conflict-free access, which would be just enough to execute a single additional conflict-free shared memory access.

In your case the code quite likely is limited by global memory bandwidth (and latency, which is hundreds of cycles, i.e. two orders of magnitude larger than the 2..6 cycles we are talking about here). So you will probably have plenty of spare cycles available where the SM is just idle waiting for the data from global memory. The bank conflicts can then use these cycles without slowing down your code at all.

It would be much more important to ensure the compiler is merging the four bytewise stores for .x, .y, .z and .w into a single 32-bit access. Look at the compiled code using cuobjdump -sass to see if that is the case. If it isn't, follow Otter's advice to use word transfers instead.

If you are only reading from d_text and not writing to it from within the kernel, you could also use a texture for it, which would still be slower than the kernel with bank conflicts but might provide other advantages to improve speed overall (e.g. if you can't guarantee proper alignment of the data in global memory).

Your alternative bank-conflict free code on the other hand splits the fast 256-byte global memory into four 64-bit transactions, which are a lot less effective and will probably overflow the maximum number of memory transactions in flight, so that you incur the full four hundred to several thousand cycles of global memory latency.
To avoid that, you need to first transfer into registers using a 256-byte wide read and then move the data from the registers into shared memory in a bank-conflict free way. Still, just the code for the register->shmem move will take up many more than the six cycles we were trying to work around.

Upvotes: 3

otter
otter

Reputation: 525

I think a DWORD copying is anyway faster than per-byte copying. Try this instead of your example:

for(int i = 0; i < 4; i++)
{
    ((int*)s_array)[4 * threadIdx.x + i] = ((int*)d_text)[i];
}

Upvotes: 1

Related Questions