knight666
knight666

Reputation: 1619

OpenCL: Outputting an array of variable length

We are working on an assignment for a GPGPU course. We picked an algorithm, implemented it on the CPU and are now converting it to OpenCL.

The algorithm we've chosen loads a model as a set of triangles and rasterizes them to voxels. The voxels are defined as a VBO of point data. We then use a geometry shader to convert these points to voxels as triangles.

So our OpenCL program needs to take a list of triangles and output a variable list of points.

And outputting a variable length array seems to be a problem.

The solution we found is to atomically increment a counter and use that counter as both an index into the output array and a final size of the array. Except... both our GPU's don't support the extension for atomic operations.

This is what we have so far:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_int64_extended_atomics : enable

#define POS1      i0 * 3 + 0
#define POS2      i0 * 3 + 1
#define POS3      i0 * 3 + 2

void WritePosition( __global float* OutBuffer, uint inIndex, __global float* inPosition )
{
    OutBuffer[ inIndex * 3 ] = inPosition[0];
    OutBuffer[ inIndex * 3 + 1] = inPosition[1];
    OutBuffer[ inIndex * 3 + 2] = inPosition[2];
}

__kernel void Voxelize( 
    __global float* outPointcloudBuffer, 
    __global float* inTriangleBuffer, 
    __global uint* inoutIndex
)
{
    size_t i0 = get_global_id(0);
    size_t i1 = get_local_id(0);

    WritePosition( outPointcloudBuffer, inIndex[0], &inTriangleBuffer[ i0 ] );

    //atomic_inc(inoutIndex[0]);
    inoutIndex[0] = max(inoutIndex[0], i0);
}

And the output of this is very odd. We're testing a very small model (12 triangles, 36 positions, 108 floats) and the result we get is either 31, 63 or 95. Always a multiple of 16 minus 1.

How can we get the length of our variable length output array?

Thanks in advance.

Upvotes: 2

Views: 2088

Answers (1)

chrish.
chrish.

Reputation: 725

I would guess that this is normally tackled as follows:

  • First pass: Calculate the required size of the array on the GPU using a scan (parallel prefix sum) primitive. Above link contains an example implementation from Apple.
  • Allocate the required resources on the host side using the result of the scan algorithm. Note, the result of the scan algorithm can often be used as an index hint for results of individual work items.
  • Second pass (optional): Compact the array to those elements that need to be considered in the third pass.
  • Third pass: Rerun the algorithm passing the destination indices and the allocated array.

You might want to have a look at NVIDIA's OpenCL marching cubes implementation where all three passes mentioned above are implemented.

Best, Christoph

Upvotes: 5

Related Questions