Reputation: 553
I want to implement a basic blocked load and warp transpose using CUDA 9.0's shuffle operations. I'm aware of the cub and trove implementations, but I'm restricted to compiling with nvrtc and the standard header includes make these libraries difficult to cater for. I'm not looking for anything fancy, just some integer, float and double shuffles on data with dimension a power of 2.
Visualising an example with warp size 8, I want to go from:
correlation
0 1 2 3
lane 0 0 8 16 24
lane 1 1 9 17 25
lane 2 2 10 18 26
lane 3 3 11 19 27
lane 4 4 12 20 28
lane 5 5 13 21 29
lane 6 6 14 22 30
lane 7 7 15 23 31
to this structure:
correlation
0 1 2 3
lane 0 0 1 2 3
lane 1 8 9 10 11
lane 2 16 17 18 19
lane 3 24 25 26 27
lane 4 4 5 6 7
lane 5 12 13 14 15
lane 6 20 21 22 23
lane 7 28 29 30 31
I feel this should be really simple but I can't figure out what I've done incorrectly. I think that the basic transposition loop should look like:
int loads[ncorrs];
int values[ncorrs];
int lane_id = threadIdx.x & (warp_size - 1);
// 0 0 0 0 4 4 4 4 8 8 8 8 ....
int base_idx = lane_id & (warp_size - ncorrs);
// 0 1 2 3 0 1 2 3 0 1 2 3
int src_corr = lane_id & (ncorrs - 1);
for(int corr=0; corr < ncorrs; ++corr)
{
int src_lane = base_idx + corr;
values[corr] = __shfl_sync(mask, loads[src_corr],
src_lane, warp_size);
}
So given the example data above, if we're in lane 5, I expect that the following indexing should occur:
base_idx == 4;
src_corr == 1;
corr == [0, 1, 2, 3]
src_lane == [4, 5, 6, 7]
values == [12, 13, 14 15]
But instead the following is happening (33's are from later in the data):
correlation
0 1 2 3
lane 0 0 0 0 0
lane 1 4 4 4 4
lane 2 12 12 12 12
lane 3 16 16 16 16
lane 4 20 20 20 20
lane 5 24 24 24 24
lane 6 28 28 28 28
lane 7 33 33 33 33
What am I doing incorrectly? Full implementation for a warp size of 32:
#include <cstdlib>
#include <cstdio>
#include "cuda.h"
#define ncorr 4
#define warp_size 32
template <int ncorrs>
__global__ void kernel(
int * input,
int * output,
int N)
{
// This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
#define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
// This should provide 0 1 2 3 0 1 2 3 0 1 2 3
#define corr_idx(lane_id) (lane_id & (ncorrs - 1))
int n = blockIdx.x*blockDim.x + threadIdx.x;
int lane_id = threadIdx.x & (warp_size - 1);
if(n >= N)
{ return; }
// Input correlation handled by this thread
int src_corr = corr_idx(lane_id);
int mask = __activemask();
if(threadIdx.x == 0)
{ printf("mask %d\n", mask); }
int loads[ncorrs];
int values[ncorrs];
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{ loads[corr] = input[n + corr*N]; }
__syncthreads();
printf("[%d, %d] %d %d %d %d\n",
lane_id, base_idx(lane_id),
loads[0], loads[1],
loads[2], loads[3]);
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{
int src_lane = base_idx(lane_id) + corr;
values[corr] = __shfl_sync(mask, loads[src_corr],
src_lane, warp_size);
}
printf("[%d, %d] %d %d %d %d\n",
lane_id, base_idx(lane_id),
values[0], values[1],
values[2], values[3]);
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{ output[n + corr*N] = values[corr]; }
}
void print_data(int * data, int N)
{
for(int n=0; n < N; ++n)
{
printf("% -3d: ", n);
for(int c=0; c < ncorr; ++c)
{
printf("%d ", data[n*ncorr + c]);
}
printf("\n");
}
}
int main(void)
{
int * host_input;
int * host_output;
int * device_input;
int * device_output;
int N = 32;
host_input = (int *) malloc(sizeof(int)*N*ncorr);
host_output = (int *) malloc(sizeof(int)*N*ncorr);
printf("malloc done\n");
cudaMalloc((void **) &device_input, sizeof(int)*N*ncorr);
cudaMalloc((void **) &device_output, sizeof(int)*N*ncorr);
printf("cudaMalloc done\n");
for(int i=0; i < N*ncorr; ++i)
{ host_input[i] = i; }
print_data(host_input, N);
dim3 block(256, 1, 1);
dim3 grid((block.x + N - 1) / N, 1, 1);
cudaMemcpy(device_input, host_input,
sizeof(int)*N*ncorr, cudaMemcpyHostToDevice);
printf("memcpy done\n");
kernel<4> <<<grid, block>>> (device_input, device_output, N);
cudaMemcpy(host_output, device_output,
sizeof(int)*N*ncorr, cudaMemcpyDeviceToHost);
print_data(host_output, N);
cudaFree(device_input);
cudaFree(device_output);
free(host_input);
free(host_output);
}
Edit 1: Clarified that the visual example has a warp size of 8 while the full code caters for a warp size of 32
Upvotes: 0
Views: 635
Reputation: 152113
What am I doing incorrectly?
TL;DR: In short, you are transmitting the same input value to multiple output values. Here is one example, in this line of code:
values[corr] = __shfl_sync(mask, loads[src_corr],
src_lane, warp_size);
The quantity represented by loads[src_corr]
is loop-invariant. Therefore you are transmitting that value to 4 warp lanes (over the 4 loop iterations) which means that value is occupying 4 output values (which is exactly what your printout data shows). That can't be right for a transpose.
Taking a somewhat longer view, with another example from your code:
I'm not sure I can read your mind, but possibly you may be confused about the warp shuffle operation. Possibly you have assumed that the destination lane can choose which value from the source lane loads[]
array is desired. This is not the case. The destination lane only gets to select whatever is the value provided by the source lane. Let's take a look at your loop:
// This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
#define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
// This should provide 0 1 2 3 0 1 2 3 0 1 2 3
#define corr_idx(lane_id) (lane_id & (ncorrs - 1))
int n = blockIdx.x*blockDim.x + threadIdx.x;
int lane_id = threadIdx.x & (warp_size - 1);
...
// Input correlation handled by this thread
int src_corr = corr_idx(lane_id);
int mask = __activemask();
...
int loads[ncorrs];
int values[ncorrs];
...
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{
int src_lane = base_idx(lane_id) + corr;
values[corr] = __shfl_sync(mask, loads[src_corr], src_lane, warp_size);
}
On the first pass of the above loop, the src_lane
for warp lanes 0, 1, 2, and 3 are all going to be 0. This is evident from the above excerpted code, or print it out if you're not sure. That means warp lanes 0-3 are going to be requesting whatever value is provided by warp lane 0. The value provided by warp lane 0 is loads[src_corr]
, but the interpretation of src_corr
here is whatever value it has for warp lane 0. Therefore one and only one value will be distributed to warp lanes 0-3. This could not possibly be correct for a transpose; no input value shows up in 4 places in the output.
To fix this, we will need to modify the calculation both of src_lane
and src_corr
. We will also need to modify the storage location (index) per-warp-lane, at each pass the of the loop (I'm calling this new variable dest
.) We can think of src_lane
as defining the target value that my thread will receive. We can think of src_corr
as defining which of my values I will publish to some other thread, on that loop iteration. dest
is the location in my values[]
array that I will store the currently received value. We can deduce the necessary pattern by carefully studying the relationship between the input value in loads[]
, the desired output location in values[]
, taking into account the appropriate warp lanes for source and destination. On the first pass of the loop, we desire this pattern:
warp lane: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
src_lane: 0 8 16 24 1 9 17 25 2 10 18 26 3 11 19 27 4 ... (where my data comes from)
src_corr: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 ... (which value I am transmitting)
dest: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 ... (where I store the received value)
On the second pass of the loop, we desire this pattern:
warp lane: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
src_lane: 8 16 24 0 9 17 25 1 10 18 26 2 11 19 27 3 19 ... (where my data comes from)
src_corr: 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 1 ... (which value I am transmitting)
dest: 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 ... (where I store the received value)
with corresponding changes for the 3rd and 4th pass of the loop. If we realize those patterns in code for your shuffle loop, it could look something like this:
$ cat t352.cu
#include <cstdlib>
#include <cstdio>
#include <assert.h>
#define ncorr 4
#define warp_size 32
template <int ncorrs>
__global__ void kernel(
int * input,
int * output,
int N)
{
// This should provide 0 0 0 0 4 4 4 4 8 8 8 8 ...
#define base_idx(lane_id) (lane_id & (warp_size - ncorrs))
// This should provide 0 1 2 3 0 1 2 3 0 1 2 3
#define corr_idx(lane_id) (lane_id & (ncorrs - 1))
int n = blockIdx.x*blockDim.x + threadIdx.x;
int lane_id = threadIdx.x & (warp_size - 1);
if(n >= N)
{ return; }
// Input correlation handled by this thread
int mask = __activemask();
if(threadIdx.x == 0)
{ printf("mask %d\n", mask); }
int loads[ncorrs];
int values[ncorrs];
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{ loads[corr] = input[n + corr*N]; }
__syncthreads();
printf("[%d, %d] %d %d %d %d\n",
lane_id, base_idx(lane_id),
loads[0], loads[1],
loads[2], loads[3]);
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{
int src_lane = ((lane_id+corr)%ncorrs)*(warp_size/ncorrs) + (lane_id/ncorrs);
int src_corr = ((ncorrs-corr)+(lane_id/(warp_size/ncorrs)))%ncorrs;
int dest = (lane_id+corr)%ncorrs;
values[dest] = __shfl_sync(mask, loads[src_corr],
src_lane, warp_size);
}
printf("[%d, %d] %d %d %d %d\n",
lane_id, base_idx(lane_id),
values[0], values[1],
values[2], values[3]);
#pragma unroll (ncorrs)
for(int corr=0; corr < ncorrs; ++corr)
{ output[n + corr*N] = values[corr]; }
}
void print_data(int * data, int N)
{
for(int n=0; n < N; ++n)
{
printf("% -3d: ", n);
for(int c=0; c < ncorr; ++c)
{
printf("%d ", data[n*ncorr + c]);
}
printf("\n");
}
}
int main(void)
{
int * host_input;
int * host_output;
int * device_input;
int * device_output;
int N = 32;
host_input = (int *) malloc(sizeof(int)*N*ncorr);
host_output = (int *) malloc(sizeof(int)*N*ncorr);
printf("malloc done\n");
cudaMalloc((void **) &device_input, sizeof(int)*N*ncorr);
cudaMalloc((void **) &device_output, sizeof(int)*N*ncorr);
printf("cudaMalloc done\n");
for(int i=0; i < N*ncorr; ++i)
{ host_input[i] = i; }
print_data(host_input, N);
dim3 block(256, 1, 1);
dim3 grid((block.x + N - 1) / N, 1, 1);
cudaMemcpy(device_input, host_input,
sizeof(int)*N*ncorr, cudaMemcpyHostToDevice);
printf("memcpy done\n");
kernel<4> <<<grid, block>>> (device_input, device_output, N);
cudaMemcpy(host_output, device_output,
sizeof(int)*N*ncorr, cudaMemcpyDeviceToHost);
print_data(host_output, N);
cudaFree(device_input);
cudaFree(device_output);
free(host_input);
free(host_output);
}
$ nvcc -o t352 t352.cu
$ cuda-memcheck ./t352
========= CUDA-MEMCHECK
malloc done
cudaMalloc done
0 : 0 1 2 3
1 : 4 5 6 7
2 : 8 9 10 11
3 : 12 13 14 15
4 : 16 17 18 19
5 : 20 21 22 23
6 : 24 25 26 27
7 : 28 29 30 31
8 : 32 33 34 35
9 : 36 37 38 39
10: 40 41 42 43
11: 44 45 46 47
12: 48 49 50 51
13: 52 53 54 55
14: 56 57 58 59
15: 60 61 62 63
16: 64 65 66 67
17: 68 69 70 71
18: 72 73 74 75
19: 76 77 78 79
20: 80 81 82 83
21: 84 85 86 87
22: 88 89 90 91
23: 92 93 94 95
24: 96 97 98 99
25: 100 101 102 103
26: 104 105 106 107
27: 108 109 110 111
28: 112 113 114 115
29: 116 117 118 119
30: 120 121 122 123
31: 124 125 126 127
memcpy done
mask -1
[0, 0] 0 32 64 96
[1, 0] 1 33 65 97
[2, 0] 2 34 66 98
[3, 0] 3 35 67 99
[4, 4] 4 36 68 100
[5, 4] 5 37 69 101
[6, 4] 6 38 70 102
[7, 4] 7 39 71 103
[8, 8] 8 40 72 104
[9, 8] 9 41 73 105
[10, 8] 10 42 74 106
[11, 8] 11 43 75 107
[12, 12] 12 44 76 108
[13, 12] 13 45 77 109
[14, 12] 14 46 78 110
[15, 12] 15 47 79 111
[16, 16] 16 48 80 112
[17, 16] 17 49 81 113
[18, 16] 18 50 82 114
[19, 16] 19 51 83 115
[20, 20] 20 52 84 116
[21, 20] 21 53 85 117
[22, 20] 22 54 86 118
[23, 20] 23 55 87 119
[24, 24] 24 56 88 120
[25, 24] 25 57 89 121
[26, 24] 26 58 90 122
[27, 24] 27 59 91 123
[28, 28] 28 60 92 124
[29, 28] 29 61 93 125
[30, 28] 30 62 94 126
[31, 28] 31 63 95 127
[0, 0] 0 8 16 24
[1, 0] 32 40 48 56
[2, 0] 64 72 80 88
[3, 0] 96 104 112 120
[4, 4] 1 9 17 25
[5, 4] 33 41 49 57
[6, 4] 65 73 81 89
[7, 4] 97 105 113 121
[8, 8] 2 10 18 26
[9, 8] 34 42 50 58
[10, 8] 66 74 82 90
[11, 8] 98 106 114 122
[12, 12] 3 11 19 27
[13, 12] 35 43 51 59
[14, 12] 67 75 83 91
[15, 12] 99 107 115 123
[16, 16] 4 12 20 28
[17, 16] 36 44 52 60
[18, 16] 68 76 84 92
[19, 16] 100 108 116 124
[20, 20] 5 13 21 29
[21, 20] 37 45 53 61
[22, 20] 69 77 85 93
[23, 20] 101 109 117 125
[24, 24] 6 14 22 30
[25, 24] 38 46 54 62
[26, 24] 70 78 86 94
[27, 24] 102 110 118 126
[28, 28] 7 15 23 31
[29, 28] 39 47 55 63
[30, 28] 71 79 87 95
[31, 28] 103 111 119 127
0 : 0 32 64 96
1 : 1 33 65 97
2 : 2 34 66 98
3 : 3 35 67 99
4 : 4 36 68 100
5 : 5 37 69 101
6 : 6 38 70 102
7 : 7 39 71 103
8 : 8 40 72 104
9 : 9 41 73 105
10: 10 42 74 106
11: 11 43 75 107
12: 12 44 76 108
13: 13 45 77 109
14: 14 46 78 110
15: 15 47 79 111
16: 16 48 80 112
17: 17 49 81 113
18: 18 50 82 114
19: 19 51 83 115
20: 20 52 84 116
21: 21 53 85 117
22: 22 54 86 118
23: 23 55 87 119
24: 24 56 88 120
25: 25 57 89 121
26: 26 58 90 122
27: 27 59 91 123
28: 28 60 92 124
29: 29 61 93 125
30: 30 62 94 126
31: 31 63 95 127
========= ERROR SUMMARY: 0 errors
$
I believe the above code fairly clearly demonstrates a 32x4 -> 4x32 transpose. I think it is "closest" to the code you presented. It does not do the set of 4x8 transposes you depicted in your diagrams.
I acknowledge that the calculations of src_corr
, src_lane
, and dest
are not completely optimized. But they generate the correct indexing. I assume you can work out how to optimally generate those from the patterns you already have.
I think its entirely possible the above code has bugs for other dimensions. I've not tried it on anything except the 32x4 case. Nevertheless I think I have indicated what is fundamentally wrong with your code, and demonstrated a pathway to get to proper indexing.
A square matrix transpose up to 32x32 can be done at the warp level using a simpler method
Upvotes: 6