Reputation: 690
I've been learning Cuda and I am still getting to grips with parallelism. The problem I am having at the moment is implementing a max reduce on an array of values. This is my kernel
__global__ void max_reduce(const float* const d_array,
float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (gid == 0)
*d_max = shared[tid];
}
I have implemented a min reduce using the same method (replacing the max function with the min) which works fine.
To test the kernel, I found the min and max values using a serial for loop. The min and max values always come out the same in the kernel but only the min reduce matches up.
Is there something obvious I'm missing/doing wrong?
Upvotes: 12
Views: 17068
Reputation: 412
Here's one that appears naive but isn't. This won't generalize to other functions like sum()
, but it works great for min()
and max()
.
__device__ const float float_min = -3.402e+38;
__global__ void maxKernel(float* d_data)
{
// compute max over all threads, store max in d_data[0]
int i = threadIdx.x;
__shared__ float max_value;
if (i == 0) max_value = float_min;
float v = d_data[i];
__syncthreads();
while (max_value < v) max_value = v;
__syncthreads();
if (i == 0) d_data[0] = max_value;
}
Yup, that's right, only syncing once after initialization and once before writing the result. Damn the race conditions! Full speed ahead!
Before you tell me it won't work, please give it a try first. I have tested thoroughly and it works every time on a variety of arbitrary kernel sizes. It turns out that the race condition doesn't matter in this case because the while loop resolves it.
It works significantly faster than a conventional reduction. Another surprise is that the average number of passes for a kernel size of 32 is 4. Yup, that's (log(n)-1), which seems counterintuitive. It's because the race condition gives an opportunity for good luck. This bonus comes in addition to removing the overhead of the conventional reduction.
With larger n, there is no way to avoid at least one iteration per warp, but that iteration only involves one compare operation which is usually immediately false across the warp when max_value is on the high end of the distribution. You could modify it to use multiple SM's, but that would greatly increase the total workload and add a communication cost, so not likely to help.
For terseness I've omitted the size and output arguments. Size is simply the number of threads (which could be 137 or whatever you like). Output is returned in d_data[0]
.
I've uploaded the working file here: https://github.com/kenseehart/YAMR
Upvotes: -1
Reputation: 151982
Your main conclusion in your deleted answer was correct: the kernel you have posted doesn't comprehend the fact that at the end of that kernel execution, you have done a good deal of the overall reduction, but the results are not quite complete. The results of each block must be combined (somehow). As pointed out in the comments, there are a few other issues with your code as well. Let's take a look at a modified version of it:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX; // 1
if (gid < elements)
shared[tid] = d_array[gid];
__syncthreads();
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]); // 2
__syncthreads();
}
// what to do now?
// option 1: save block result and launch another kernel
if (tid == 0)
d_max[blockIdx.x] = shared[tid]; // 3
// option 2: use atomics
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
gridDim.x*blockDim.x
is greater than elements
. gid
) is less than elements
, when we add s
to gid
for indexing into the shared memory we can still index outside of the legitimate values copied into shared memory, in the last block. Therefore we need the shared memory initialization indicated in note 1.atomicMax
function for floats, but as indicated in the documentation, you can use atomicCAS
to generate any arbitrary atomic function, and I have provided an example of that in atomicMaxf
which provides an atomic max for float
.But is running 1024 or more atomic functions (one per block) the best way? Probably not.
When launching kernels of threadblocks, we really only need to launch enough threadblocks to keep the machine busy. As a rule of thumb we want at least 4-8 warps operating per SM, and somewhat more is probably a good idea. But there's no particular benefit from a machine utilization standpoint to launch thousands of threadblocks initially. If we pick a number like 8 threadblocks per SM, and we have at most, say, 14-16 SMs in our GPU, this gives us a relatively small number of 8*14 = 112 threadblocks. Let's choose 128 (8*16) for a nice round number. There's nothing magical about this, it's just enough to keep the GPU busy. If we make each of these 128 threadblocks do additional work to solve the whole problem, we can then leverage our use of atomics without (perhaps) paying too much of a penalty for doing so, and avoid multiple kernel launches. So how would this look?:
__device__ float atomicMaxf(float* address, float val)
{
int *address_as_int =(int*)address;
int old = *address_as_int, assumed;
while (val > __int_as_float(old)) {
assumed = old;
old = atomicCAS(address_as_int, assumed,
__float_as_int(val));
}
return __int_as_float(old);
}
__global__ void max_reduce(const float* const d_array, float* d_max,
const size_t elements)
{
extern __shared__ float shared[];
int tid = threadIdx.x;
int gid = (blockDim.x * blockIdx.x) + tid;
shared[tid] = -FLOAT_MAX;
while (gid < elements) {
shared[tid] = max(shared[tid], d_array[gid]);
gid += gridDim.x*blockDim.x;
}
__syncthreads();
gid = (blockDim.x * blockIdx.x) + tid; // 1
for (unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s && gid < elements)
shared[tid] = max(shared[tid], shared[tid + s]);
__syncthreads();
}
if (tid == 0)
atomicMaxf(d_max, shared[0]);
}
With this modified kernel, when creating the kernel launch, we are not deciding how many threadblocks to launch based on the overall data size (elements
). Instead we are launching a fixed number of blocks (say, 128, you can modify this number to find out what runs fastest), and letting each threadblock (and thus the entire grid) loop through memory, computing partial max operations on each element in shared memory. Then, in the line marked with comment 1, we must re-set the gid
variable to it's initial value. This is actually unnecessary and the block reduction loop code can be further simplified if we guarantee that the size of the grid (gridDim.x*blockDim.x
) is less than elements
, which is not difficult to do at kernel launch.
Note that when using this atomic method, it's necessary to initialize the result (*d_max
in this case) to an appropriate value, like -FLOAT_MAX
.
Again, we normally steer people way from atomic usage, but in this case, it's worth considering if we carefully manage it, and it allows us to save the overhead of an additional kernel launch.
For a ninja-level analysis of how to do fast parallel reductions, take a look at Mark Harris' excellent whitepaper which is available with the relevant CUDA sample.
Upvotes: 22