Simon
Simon

Reputation: 553

Basic CUDA load and warp transpose

I want to implement a basic blocked load and warp transpose using CUDA 9.0's shuffle operations. I'm aware of the cub and trove implementations, but I'm restricted to compiling with nvrtc and the standard header includes make these libraries difficult to cater for. I'm not looking for anything fancy, just some integer, float and double shuffles on data with dimension a power of 2.

Visualising an example with warp size 8, I want to go from:

             correlation
             0    1    2    3

lane 0       0    8   16   24
lane 1       1    9   17   25
lane 2       2   10   18   26
lane 3       3   11   19   27
lane 4       4   12   20   28
lane 5       5   13   21   29
lane 6       6   14   22   30 
lane 7       7   15   23   31 

to this structure:

             correlation
             0    1    2    3

lane 0       0    1    2    3
lane 1       8    9   10   11
lane 2       16  17   18   19
lane 3       24  25   26   27 
lane 4       4    5    6    7
lane 5       12  13   14   15
lane 6       20  21   22   23
lane 7       28  29   30   31 

I feel this should be really simple but I can't figure out what I've done incorrectly. I think that the basic transposition loop should look like:

int loads[ncorrs];
int values[ncorrs];
int lane_id = threadIdx.x & (warp_size - 1);
// 0 0 0 0 4 4 4 4 8 8 8 8 ....
int base_idx = lane_id & (warp_size - ncorrs);
// 0 1 2 3 0 1 2 3 0 1 2 3
int src_corr = lane_id & (ncorrs - 1);

for(int corr=0; corr < ncorrs; ++corr)
{
    int src_lane = base_idx + corr;
    values[corr] = __shfl_sync(mask, loads[src_corr],
                                 src_lane, warp_size);
}

So given the example data above, if we're in lane 5, I expect that the following indexing should occur:

base_idx == 4;
src_corr == 1;

corr == [0, 1, 2, 3]
src_lane == [4, 5, 6, 7]
values == [12, 13, 14 15]

But instead the following is happening (33's are from later in the data):

             correlation
             0    1    2    3

lane 0       0    0    0    0
lane 1       4    4    4    4
lane 2       12  12   12   12
lane 3       16  16   16   16
lane 4       20  20   20   20
lane 5       24  24   24   24
lane 6       28  28   28   28 
lane 7       33  33   33   33 

What am I doing incorrectly? Full implementation for a warp size of 32:

#include <cstdlib>
#include <cstdio>

#include "cuda.h"

#define ncorr 4
#define warp_size 32

template <int ncorrs>
__global__ void kernel(
    int * input,
    int * output,
    int N)
{
    // This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
    #define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
    // This should provide 0 1 2 3 0 1 2 3 0 1 2 3
    #define corr_idx(lane_id) (lane_id & (ncorrs - 1))


    int n = blockIdx.x*blockDim.x + threadIdx.x;
    int lane_id = threadIdx.x & (warp_size - 1);

    if(n >= N)
        { return; }

    // Input correlation handled by this thread
    int src_corr = corr_idx(lane_id);
    int mask = __activemask();

    if(threadIdx.x == 0)
        { printf("mask %d\n", mask); }

    int loads[ncorrs];
    int values[ncorrs];

    #pragma unroll (ncorrs)
    for(int corr=0; corr < ncorrs; ++corr)
        { loads[corr] = input[n + corr*N]; }

    __syncthreads();

    printf("[%d, %d] %d %d %d %d\n",
           lane_id, base_idx(lane_id),
           loads[0], loads[1],
           loads[2], loads[3]);

    #pragma unroll (ncorrs)
    for(int corr=0; corr < ncorrs; ++corr)
    {
        int src_lane = base_idx(lane_id) + corr;
        values[corr] = __shfl_sync(mask, loads[src_corr],
                                     src_lane, warp_size);
    }

    printf("[%d, %d] %d %d %d %d\n",
           lane_id, base_idx(lane_id),
           values[0], values[1],
           values[2], values[3]);


    #pragma unroll (ncorrs)
    for(int corr=0; corr < ncorrs; ++corr)
        { output[n + corr*N] = values[corr]; }
}

void print_data(int * data, int N)
{
    for(int n=0; n < N; ++n)
    {
        printf("% -3d: ", n);
        for(int c=0; c < ncorr; ++c)
        {
            printf("%d ", data[n*ncorr + c]);
        }
        printf("\n");
    }
}

int main(void)
{
    int * host_input;
    int * host_output;

    int * device_input;
    int * device_output;
    int N = 32;

    host_input = (int *) malloc(sizeof(int)*N*ncorr);
    host_output = (int *) malloc(sizeof(int)*N*ncorr);

    printf("malloc done\n");

    cudaMalloc((void **) &device_input, sizeof(int)*N*ncorr);
    cudaMalloc((void **) &device_output, sizeof(int)*N*ncorr);

    printf("cudaMalloc done\n");

    for(int i=0; i < N*ncorr; ++i)
        { host_input[i] = i; }

    print_data(host_input, N);

    dim3 block(256, 1, 1);
    dim3 grid((block.x + N - 1) / N, 1, 1);

    cudaMemcpy(device_input, host_input,
               sizeof(int)*N*ncorr, cudaMemcpyHostToDevice);

    printf("memcpy done\n");

    kernel<4> <<<grid, block>>> (device_input, device_output, N);

    cudaMemcpy(host_output, device_output,
               sizeof(int)*N*ncorr, cudaMemcpyDeviceToHost);

    print_data(host_output, N);

    cudaFree(device_input);
    cudaFree(device_output);

    free(host_input);
    free(host_output);
}

Edit 1: Clarified that the visual example has a warp size of 8 while the full code caters for a warp size of 32

Upvotes: 0

Views: 635

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 152113

What am I doing incorrectly?

TL;DR: In short, you are transmitting the same input value to multiple output values. Here is one example, in this line of code:

    values[corr] = __shfl_sync(mask, loads[src_corr],
                                 src_lane, warp_size);

The quantity represented by loads[src_corr] is loop-invariant. Therefore you are transmitting that value to 4 warp lanes (over the 4 loop iterations) which means that value is occupying 4 output values (which is exactly what your printout data shows). That can't be right for a transpose.

Taking a somewhat longer view, with another example from your code:

I'm not sure I can read your mind, but possibly you may be confused about the warp shuffle operation. Possibly you have assumed that the destination lane can choose which value from the source lane loads[] array is desired. This is not the case. The destination lane only gets to select whatever is the value provided by the source lane. Let's take a look at your loop:

// This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
#define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
// This should provide 0 1 2 3 0 1 2 3 0 1 2 3
#define corr_idx(lane_id) (lane_id & (ncorrs - 1))


int n = blockIdx.x*blockDim.x + threadIdx.x;
int lane_id = threadIdx.x & (warp_size - 1);

...

// Input correlation handled by this thread
int src_corr = corr_idx(lane_id);
int mask = __activemask();

...

int loads[ncorrs];
int values[ncorrs];

...

#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{
    int src_lane = base_idx(lane_id) + corr;
    values[corr] = __shfl_sync(mask, loads[src_corr], src_lane, warp_size);
}

On the first pass of the above loop, the src_lane for warp lanes 0, 1, 2, and 3 are all going to be 0. This is evident from the above excerpted code, or print it out if you're not sure. That means warp lanes 0-3 are going to be requesting whatever value is provided by warp lane 0. The value provided by warp lane 0 is loads[src_corr], but the interpretation of src_corr here is whatever value it has for warp lane 0. Therefore one and only one value will be distributed to warp lanes 0-3. This could not possibly be correct for a transpose; no input value shows up in 4 places in the output.

To fix this, we will need to modify the calculation both of src_lane and src_corr. We will also need to modify the storage location (index) per-warp-lane, at each pass the of the loop (I'm calling this new variable dest.) We can think of src_lane as defining the target value that my thread will receive. We can think of src_corr as defining which of my values I will publish to some other thread, on that loop iteration. dest is the location in my values[] array that I will store the currently received value. We can deduce the necessary pattern by carefully studying the relationship between the input value in loads[], the desired output location in values[], taking into account the appropriate warp lanes for source and destination. On the first pass of the loop, we desire this pattern:

warp lane: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 ... 
src_lane:  0  8 16 24  1  9 17 25  2 10 18 26  3 11 19 27  4 ... (where my data comes from)
src_corr:  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  2 ... (which value I am transmitting)
dest:      0  1  2  3  0  1  2  3  0  1  2  3  0  1  2  3  0 ... (where I store the received value)

On the second pass of the loop, we desire this pattern:

warp lane: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 ... 
src_lane:  8 16 24  0  9 17 25  1 10 18 26  2 11 19 27  3 19 ... (where my data comes from)
src_corr:  3  3  3  3  3  3  3  3  0  0  0  0  0  0  0  0  1 ... (which value I am transmitting)
dest:      1  2  3  0  1  2  3  0  1  2  3  0  1  2  3  0  1 ... (where I store the received value)

with corresponding changes for the 3rd and 4th pass of the loop. If we realize those patterns in code for your shuffle loop, it could look something like this:

$ cat t352.cu
#include <cstdlib>
#include <cstdio>

#include <assert.h>
#define ncorr 4
#define warp_size 32

template <int ncorrs>
__global__ void kernel(
    int * input,
    int * output,
    int N)
{
    // This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
    #define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
    // This should provide 0 1 2 3 0 1 2 3 0 1 2 3
    #define corr_idx(lane_id) (lane_id & (ncorrs - 1))


    int n = blockIdx.x*blockDim.x + threadIdx.x;
    int lane_id = threadIdx.x & (warp_size - 1);

    if(n >= N)
        { return; }

    // Input correlation handled by this thread
    int mask = __activemask();

    if(threadIdx.x == 0)
        { printf("mask %d\n", mask); }

    int loads[ncorrs];
    int values[ncorrs];

    #pragma unroll (ncorrs)
    for(int corr=0; corr < ncorrs; ++corr)
        { loads[corr] = input[n + corr*N]; }

    __syncthreads();

    printf("[%d, %d] %d %d %d %d\n",
           lane_id, base_idx(lane_id),
           loads[0], loads[1],
           loads[2], loads[3]);
    #pragma unroll (ncorrs)
    for(int corr=0; corr < ncorrs; ++corr)
    {
        int src_lane = ((lane_id+corr)%ncorrs)*(warp_size/ncorrs) + (lane_id/ncorrs);
        int src_corr = ((ncorrs-corr)+(lane_id/(warp_size/ncorrs)))%ncorrs;
        int dest = (lane_id+corr)%ncorrs;
        values[dest] = __shfl_sync(mask, loads[src_corr],
                                     src_lane, warp_size);
    }

    printf("[%d, %d] %d %d %d %d\n",
           lane_id, base_idx(lane_id),
           values[0], values[1],
           values[2], values[3]);


    #pragma unroll (ncorrs)
    for(int corr=0; corr < ncorrs; ++corr)
        { output[n + corr*N] = values[corr]; }
}

void print_data(int * data, int N)
{
    for(int n=0; n < N; ++n)
    {
        printf("% -3d: ", n);
        for(int c=0; c < ncorr; ++c)
        {
            printf("%d ", data[n*ncorr + c]);
        }
        printf("\n");
    }
}

int main(void)
{
    int * host_input;
    int * host_output;

    int * device_input;
    int * device_output;
    int N = 32;

    host_input = (int *) malloc(sizeof(int)*N*ncorr);
    host_output = (int *) malloc(sizeof(int)*N*ncorr);

    printf("malloc done\n");

    cudaMalloc((void **) &device_input, sizeof(int)*N*ncorr);
    cudaMalloc((void **) &device_output, sizeof(int)*N*ncorr);

    printf("cudaMalloc done\n");

    for(int i=0; i < N*ncorr; ++i)
        { host_input[i] = i; }

    print_data(host_input, N);

    dim3 block(256, 1, 1);
    dim3 grid((block.x + N - 1) / N, 1, 1);

    cudaMemcpy(device_input, host_input,
               sizeof(int)*N*ncorr, cudaMemcpyHostToDevice);

    printf("memcpy done\n");

    kernel<4> <<<grid, block>>> (device_input, device_output, N);

    cudaMemcpy(host_output, device_output,
               sizeof(int)*N*ncorr, cudaMemcpyDeviceToHost);

    print_data(host_output, N);
    cudaFree(device_input);
    cudaFree(device_output);

    free(host_input);
    free(host_output);
}
$ nvcc -o t352 t352.cu
$ cuda-memcheck ./t352
========= CUDA-MEMCHECK
malloc done
cudaMalloc done
 0 : 0 1 2 3
 1 : 4 5 6 7
 2 : 8 9 10 11
 3 : 12 13 14 15
 4 : 16 17 18 19
 5 : 20 21 22 23
 6 : 24 25 26 27
 7 : 28 29 30 31
 8 : 32 33 34 35
 9 : 36 37 38 39
 10: 40 41 42 43
 11: 44 45 46 47
 12: 48 49 50 51
 13: 52 53 54 55
 14: 56 57 58 59
 15: 60 61 62 63
 16: 64 65 66 67
 17: 68 69 70 71
 18: 72 73 74 75
 19: 76 77 78 79
 20: 80 81 82 83
 21: 84 85 86 87
 22: 88 89 90 91
 23: 92 93 94 95
 24: 96 97 98 99
 25: 100 101 102 103
 26: 104 105 106 107
 27: 108 109 110 111
 28: 112 113 114 115
 29: 116 117 118 119
 30: 120 121 122 123
 31: 124 125 126 127
memcpy done
mask -1
[0, 0] 0 32 64 96
[1, 0] 1 33 65 97
[2, 0] 2 34 66 98
[3, 0] 3 35 67 99
[4, 4] 4 36 68 100
[5, 4] 5 37 69 101
[6, 4] 6 38 70 102
[7, 4] 7 39 71 103
[8, 8] 8 40 72 104
[9, 8] 9 41 73 105
[10, 8] 10 42 74 106
[11, 8] 11 43 75 107
[12, 12] 12 44 76 108
[13, 12] 13 45 77 109
[14, 12] 14 46 78 110
[15, 12] 15 47 79 111
[16, 16] 16 48 80 112
[17, 16] 17 49 81 113
[18, 16] 18 50 82 114
[19, 16] 19 51 83 115
[20, 20] 20 52 84 116
[21, 20] 21 53 85 117
[22, 20] 22 54 86 118
[23, 20] 23 55 87 119
[24, 24] 24 56 88 120
[25, 24] 25 57 89 121
[26, 24] 26 58 90 122
[27, 24] 27 59 91 123
[28, 28] 28 60 92 124
[29, 28] 29 61 93 125
[30, 28] 30 62 94 126
[31, 28] 31 63 95 127
[0, 0] 0 8 16 24
[1, 0] 32 40 48 56
[2, 0] 64 72 80 88
[3, 0] 96 104 112 120
[4, 4] 1 9 17 25
[5, 4] 33 41 49 57
[6, 4] 65 73 81 89
[7, 4] 97 105 113 121
[8, 8] 2 10 18 26
[9, 8] 34 42 50 58
[10, 8] 66 74 82 90
[11, 8] 98 106 114 122
[12, 12] 3 11 19 27
[13, 12] 35 43 51 59
[14, 12] 67 75 83 91
[15, 12] 99 107 115 123
[16, 16] 4 12 20 28
[17, 16] 36 44 52 60
[18, 16] 68 76 84 92
[19, 16] 100 108 116 124
[20, 20] 5 13 21 29
[21, 20] 37 45 53 61
[22, 20] 69 77 85 93
[23, 20] 101 109 117 125
[24, 24] 6 14 22 30
[25, 24] 38 46 54 62
[26, 24] 70 78 86 94
[27, 24] 102 110 118 126
[28, 28] 7 15 23 31
[29, 28] 39 47 55 63
[30, 28] 71 79 87 95
[31, 28] 103 111 119 127
 0 : 0 32 64 96
 1 : 1 33 65 97
 2 : 2 34 66 98
 3 : 3 35 67 99
 4 : 4 36 68 100
 5 : 5 37 69 101
 6 : 6 38 70 102
 7 : 7 39 71 103
 8 : 8 40 72 104
 9 : 9 41 73 105
 10: 10 42 74 106
 11: 11 43 75 107
 12: 12 44 76 108
 13: 13 45 77 109
 14: 14 46 78 110
 15: 15 47 79 111
 16: 16 48 80 112
 17: 17 49 81 113
 18: 18 50 82 114
 19: 19 51 83 115
 20: 20 52 84 116
 21: 21 53 85 117
 22: 22 54 86 118
 23: 23 55 87 119
 24: 24 56 88 120
 25: 25 57 89 121
 26: 26 58 90 122
 27: 27 59 91 123
 28: 28 60 92 124
 29: 29 61 93 125
 30: 30 62 94 126
 31: 31 63 95 127
========= ERROR SUMMARY: 0 errors
$
  1. I believe the above code fairly clearly demonstrates a 32x4 -> 4x32 transpose. I think it is "closest" to the code you presented. It does not do the set of 4x8 transposes you depicted in your diagrams.

  2. I acknowledge that the calculations of src_corr, src_lane, and dest are not completely optimized. But they generate the correct indexing. I assume you can work out how to optimally generate those from the patterns you already have.

  3. I think its entirely possible the above code has bugs for other dimensions. I've not tried it on anything except the 32x4 case. Nevertheless I think I have indicated what is fundamentally wrong with your code, and demonstrated a pathway to get to proper indexing.

  4. A square matrix transpose up to 32x32 can be done at the warp level using a simpler method

Upvotes: 6

Related Questions