Breakthrough
Breakthrough

Reputation: 2534

What is the most efficient method to move video frame data to a GPU?

I'm currently implementing a motion tracking algorithm on my GPU (CUDA/C++), and am seeing very strong speed-ups so far. As one can probably expect, however, the main bottleneck is the actual transferring of frame (image) data from the CPU to the GPU.

As is, I'm using OpenCV to read in a test video file. OpenCV, however, returns the frames as packed bytes in the form RRGGBB RRGGBB ..., or in other terms, each pixel is aligned to 24-bit boundaries. This disallows me from using coalesced memory accesses, which causes severe performance penalties on the GPU. As-is, I'm just using some pre-generated test data which is 32-bit aligned (padded with zeros in the form RRGGBB00 RRGGBB00 ...), but I'd like to start using actual video data now.

This is causing me some significant performance penalties, so I have two main questions:

  1. Although I can pre-process the pixels of interest on the CPU manually and then initiate a transfer, is there any method which can quickly transfer the pixel data to the GPU, but instead aligned to 32 bit boundaries? (I would assume this has the same performance hit as pre-processing, however)

  2. Is there another library I can use to read in the video in a different format? For example, I know SDL surfaces are packed in 32-bit boundaries, even without the inclusion of an alpha channel.

The end goal of our implementation would be to interface in real-time with a camera for robotic control, although for now I just want something that can efficiently decode my test video to test our feature detection and motion tracking algorithms with pre-defined test data.

Upvotes: 3

Views: 2053

Answers (1)

Eugene
Eugene

Reputation: 9474

I tried writing a simple CUDA kernel that pads 24-bit values to 32-bit using the shared memory. Note that this is not a very tidy code (only works with 1 block, relies on int to be 32 bit) - use with care. I tried both a version with shared memory atomics and without - seems to be working.:

__global__ void pad(unsigned int *data, unsigned int* odata) {
__shared__ unsigned int array[WORK_SIZE];
unsigned int v, high, low;
const int index = (threadIdx.x * sizeof(unsigned int)) / 3;

array[threadIdx.x] = 0;
__syncthreads();

const int shl = threadIdx.x % 3;
const int shr = 3 - shl;

if (threadIdx.x
        < ((WORK_SIZE * 3) + sizeof(unsigned int) - 1)
                / sizeof(unsigned int)) {
    v = data[threadIdx.x];
    high = (v >> (shl * 8)) & ~0xFF;
    low = v << (shr * 8);
#if __CUDA_ARCH__ < 200
    array[index] = high;
}
__syncthreads();
if (threadIdx.x
        < ((WORK_SIZE * 3) + sizeof(unsigned int) - 1)
        / sizeof(unsigned int)) {
    array[index + 1] += low;
#else
    if (high)
        atomicOr(array + index, high);
    if (low)
        atomicOr(array + 1 + index, low);
#endif
}
__syncthreads();

// Do computations!
odata[threadIdx.x] = array[threadIdx.x] + 0xFF;
}

Upvotes: 2

Related Questions