Reputation: 356

Is the following data processing task suitable for GPU computing?

I'm looking to upgrade my graphics card to be able to process the following task in parallel. As I have no experience in GPU computing, would this task be suitable, and is it possible to estimate the rate at which the processing could be done before I buy?
My project is publicly funded but has a limited budget, so I need to make the right choice.

I have an in-house build camera chip that produces 4x 256x256 images at 100fps. The data is accessed by calling a c function, passing a pointer to an array data of type unsigned short. I can read out the data fast enough into a memory buffer. Currently the raw data is saved to disk and then processed offline later, but for future lab experiments with this camera I wish to access data derived from the images as the experiment runs. I have written in c++ using valarray, methods to calculate the derived data, but it is too slow on my current hardware at about 40ms per frame . (I have experimented with optimisation and I have cut the time considerably from >100ms) If a Frame is denoted by S, the four subframes (in time) are S1,S2,S3,S4. I must calculate the following images and the image averages, (S1+S2+S3+S4)/4,
Sqrt((S3-S1)^2 + (S4-S2)^2),
arctan(S3-S1/S2-S4)

Upvotes: 1

Answers (2)

m.s.

Reputation: 16354

In addition to what @MuertoExcobito already wrote you must also account for copying the data to and from the GPU, however in your case this is not much data.

I created a simple thrust-based implementation which can be compiled and run using CUDA 7 like this:

nvcc -std=c++11 main.cu && ./a.out

Averaged over 10000 runs one iteration which includes copying to the GPU, calculating the three result images and copying the results back from the GPU takes 1.79 ms on my computer (Ubuntu 14.04 x64, Intel [email protected] Ghz, Geforce GTX 680).

The file "helper_math.h" is adapted from the CUDA SDK and can be found here:

https://gist.github.com/dachziegel/70e008dee7e3f0c18656

#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform.h>
#include <vector_types.h>
#include <iostream>
#include <chrono>

#include "helper_math.h"

template<typename T>
struct QuadVec
{
  T S1, S2, S3, S4;
  QuadVec(const int N) : S1(N), S2(N), S3(N), S4(N){}
};

template<typename T>
struct Result
{
  T average, sqrt, arctan;
  Result(const int N) : average(N), sqrt(N), arctan(N){}
};


typedef thrust::tuple<float4,float4,float4,float4> QuadInput;
typedef thrust::tuple<float4,float4,float4> TripleOutput;

struct CalcResult : public thrust::unary_function<QuadInput,TripleOutput>
{
  __host__ __device__
  TripleOutput operator()(const QuadInput& f) const
  {
      const float4 s3ms1 = thrust::get<2>(f) - thrust::get<0>(f);
      const float4 s4ms2 = thrust::get<3>(f) - thrust::get<1>(f);
      const float4 sqrtArg = s3ms1*s3ms1 + s4ms2*s4ms2;
      const float4 atanArg = s3ms1 / s4ms2;
      return thrust::make_tuple((thrust::get<0>(f) + thrust::get<1>(f) + thrust::get<2>(f) + thrust::get<3>(f)) / 4.0f,
              make_float4(sqrtf(sqrtArg.x), sqrtf(sqrtArg.y), sqrtf(sqrtArg.z), sqrtf(sqrtArg.w)),
              make_float4(atanf(atanArg.x), atanf(atanArg.y), atanf(atanArg.z), atanf(atanArg.w))
              );
  }
};


int main()
{
  typedef thrust::host_vector<float4> HostVec;
  typedef thrust::device_vector<float4> DevVec;

  const int N = 256;

  QuadVec<HostVec> hostFrame(N*N);
  QuadVec<DevVec> devFrame(N*N);

  Result<HostVec> hostResult(N*N);
  Result<DevVec> devResult(N*N);

  const int runs = 10000;
  int accumulatedDuration = 0;
  for (int i = 0; i < runs; ++i)
  {
        auto start = std::chrono::system_clock::now();

        thrust::copy(hostFrame.S1.begin(), hostFrame.S1.end(), devFrame.S1.begin());
        thrust::copy(hostFrame.S2.begin(), hostFrame.S2.end(), devFrame.S2.begin());
        thrust::copy(hostFrame.S3.begin(), hostFrame.S3.end(), devFrame.S3.begin());
        thrust::copy(hostFrame.S4.begin(), hostFrame.S4.end(), devFrame.S4.begin());

        thrust::transform(thrust::make_zip_iterator(make_tuple(devFrame.S1.begin(), devFrame.S2.begin(), devFrame.S3.begin(), devFrame.S4.begin())),
              thrust::make_zip_iterator(make_tuple(devFrame.S1.end(), devFrame.S2.end(), devFrame.S3.end(), devFrame.S4.end())),
              thrust::make_zip_iterator(make_tuple(devResult.average.begin(), devResult.sqrt.begin(), devResult.arctan.begin())),
              CalcResult() );

        thrust::copy(devResult.average.begin(), devResult.average.end(), hostResult.average.begin());
        thrust::copy(devResult.sqrt.begin(), devResult.sqrt.end(), hostResult.sqrt.begin());
        thrust::copy(devResult.arctan.begin(), devResult.arctan.end(), hostResult.arctan.begin());

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - start);
        accumulatedDuration += duration.count();
  }

  std::cout << accumulatedDuration/runs << std::endl;
  return 0;
}

Upvotes: 0

MuertoExcobito

Reputation: 10049

It seems like a good fit for an operation to be carried out by a GPU. GPUs are better suited than CPUs to performing massive amounts of relatively simple calculations. They are not as efficient when there is logic, or interdependencies between 'threads'. Although this kind of wanders into the 'opinion' area, I'll try to back up my answer with some numbers.

As a quick estimation of the performance you can expect, I made a quick HLSL pixel shader which does your proposed operations (untested - no guarantee of functionality!):

Texture2D S[4] : register(t0);
SamplerState mySampler : register(s0);

struct PS_OUT
{
    float4 average : SV_Target0;
    float4 sqrt    : SV_Target1;
    float4 arctan  : SV_Target2;
};

PS_OUT main(float2 UV: TEXCOORD0)
{
    PS_OUT output;
    float4 SSamples[4];
    int i;
    for (i = 0; i < 4; i++)
    {
        SSamples[i] = S[i].Sample(mySampler, UV);
    }
    float4 s3ms1 = SSamples[2] - SSamples[0];
    float4 s4ms2 = SSamples[3] - SSamples[1];

    output.average = (SSamples[0] + SSamples[1] + SSamples[2] + SSamples[3]) / 4.0;
    output.sqrt    = sqrt(s3ms1*s3ms1 + s4ms2*s4ms2);
    output.arctan  = atan(s3ms1 / s4ms2);   
    return output;
}

When compiling this (fxc /T ps_4_0 example.ps), it gives the estimation of: Approximately 32 instruction slots used.

If you are processing 256x256 (64k pixels) per-frame, that works out to be about 2.1m/frame, or 210m/s, at 100fps. Looking at a chart of GPU performance (Nvidia for example: http://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units), all their GPUs past Geforce 4 (circa 2005), have sufficient speed to acheive this.

Note that, this shader performance is only an estimation, and the listed rates are theoretical maximums, and I'm only accounting for the pixel unit work (although it will be doing the majority of the work). However, with any sufficiently recent video card the FLOPS will far exceed your needs, so you should be able to easily do this on the GPU at 100fps. Assuming you have a PC newer than 2005, you probably already have a video card powerful enough.

Upvotes: 2

Is the following data processing task suitable for GPU computing?

Answers (2)

Related Questions