Clarification on the flow of realtime work to a GPU

Question

I just started learning CUDA, and I am confused by one point. For the sake of argument, imagine that I had several hundred buoys in the ocean. Imagine that they broadcast a std::vector intermittently once every few milliseconds. The vector might be 5 readings, or 10 readings, etc, depending on the conditions in the ocean at that time. There is no way to tell when the event will fire, it is not deterministic.

Imagine that I had the idea that I could predict the temperature from gathering all this information in realtime, but that the predictor had to first sort all std::vectos on temperature accross all buoy. My question is this. Do I have to copy the entire data back to the GPU every time a single buyoy fires an event? Since the other buoy's data has not changed, can I leave that data in the GPU and just update what has changed and ask the kernel to rerun the prediction?

If yes, what is the [thrust pseudo]-code that would do this? Is this best done with streams and events and pinned memory? What is the limit as to how fast I can update the GPU with realtime data?

I was told that this sort of problem is not well suited to GPU and better in FPGA.

Robert Crovella · Accepted Answer

A basic sequence could be like this.

Setup phase (initial sort):

Gather an initial set of vectors from each buoy.
Create a parallel set of vectors, one for each buoy, of length equal to the initial length of the buoy vector, and popluated by the buoy index:
```
b1:  1.5 1.7 2.2 2.3 2.6
i1:    1   1   1   1   1
b2:  2.4 2.5 2.6
i2:    2   2   2
b3:  2.8
i3:    3
```

Concatenate all vectors into a single buoy-temp-vector and buoy-index-vector:

b:  1.5 1.7 2.2 2.3 2.6 2.4 2.5 2.6 2.8
i:    1   1   1   1   1   2   2   2   3

Sort-by-key:

b:  1.5 1.7 2.2 2.3 2.4 2.5 2.6 2.6 2.8
i:    1   1   1   1   2   2   1   2   3

The setup phase is complete. The update phase is executed whenever a buoy update is received. Suppose buoy 2 sends an update:

b2:  2.5 2.7 2.9 3.0

Do thrust::remove_if on the buoy vector, if the corresponding index vector position holds the updated buoy number (2 in this case). Repeat the remove_if on the index vector using the same rule:
```
b:  1.5 1.7 2.2 2.3 2.6 2.8
i:    1   1   1   1   1   3
```
Generate the corresponding index vector for the buoy to be updated, and copy both vectors (buoy 2 temp-value and index vectors) to the device:
```
b2: 2.5 2.7 2.9 3.0
i2:   2   2   2   2
```

Do thrust::merge_by_key on the newly received update from buoy 2

b: 1.5 1.7 2.2 2.3 2.5 2.6 2.7 2.8 2.9 3.0
i:   1   1   1   1   2   1   2   3   2   2

The only data that has to be copied to the device on an update cycle is the actual buoy data to be updated. Note that with some work, the setup phase could be eliminated, and the initial assembly of the vectors could be merely seen as "updates" from each buoy, into initially-empty buoy value and buoy index vectors. But for description, it's easier to visualize with a setup phase, I think. The above description doesn't explicitly point out the various vector sizings and resizings needed, but this can be accomplished using the same methods one would use on std::vector. Vector resizing may be "costly" on the GPU, just as it can be "costly" on the CPU (if a resize to larger triggers a new allocation and copy...) but this could also be elmiminated if a max number of buoys is known and a max number of elements per update is known. In that case, we could allocate our overall buoy value and buoy index vector to be the maximum necessary sizes.

Here is a fully-worked example following the above outline. As a placeholder, I have included a dummy prediction_kernel call, showing where you could insert your specialized prediction code, operating on the sorted data.

#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 

#define N_BUOYS 1024
#define N_MAX_UPDATE 1024
#define T_RANGE 100
#define N_UPDATES_TEST 1000

struct equal_func{

  const int idx;

  equal_func(int _idx) : idx(_idx) {}

  __host__ __device__
  bool operator()(int test_val) {
    return (test_val == idx);
  }
};

__device__ float dev_result[N_UPDATES_TEST];

// dummy "prediction" kernel
__global__ void prediction_kernel(const float *data, int iter, size_t d_size){
    int idx=threadIdx.x+blockDim.x*blockIdx.x;
    if (idx == 0) dev_result[iter] = data[d_size/2];
  }

void create_vec(unsigned int id, thrust::host_vector &data, thrust::host_vector &idx){
  size_t mysize = rand()%N_MAX_UPDATE;
  data.resize(mysize);
  idx.resize(mysize);
  for (int i = 0; i < mysize; i++){
    data[i] = ((float)rand()/(float)RAND_MAX)*(float)T_RANGE;
    idx[i] = id;}
  thrust::sort(data.begin(), data.end());
}

int main(){

  timeval t1, t2;
  int pp = 0;
// ping-pong processing vectors
  thrust::device_vector buoy_data[2];
  buoy_data[0].resize(N_BUOYS*N_MAX_UPDATE);
  buoy_data[1].resize(N_BUOYS*N_MAX_UPDATE);
  thrust::device_vector  buoy_idx[2];
  buoy_idx[0].resize(N_BUOYS*N_MAX_UPDATE);
  buoy_idx[1].resize(N_BUOYS*N_MAX_UPDATE);

// vectors for initial buoy data
  thrust::host_vector h_buoy_data[N_BUOYS];
  thrust::host_vector h_buoy_idx[N_BUOYS];

//SETUP
 // populate initial data
  int lidx=0;
  for (int i = 0; i < N_BUOYS; i++){
    create_vec(i, h_buoy_data[i], h_buoy_idx[i]);
    thrust::copy(h_buoy_data[i].begin(), h_buoy_data[i].end(), &(buoy_data[pp][lidx]));
    thrust::copy(h_buoy_idx[i].begin(), h_buoy_idx[i].end(), &(buoy_idx[pp][lidx]));
    lidx+= h_buoy_data[i].size();}
 // sort initial data
  thrust::sort_by_key(&(buoy_data[pp][0]), &(buoy_data[pp][lidx]), &(buoy_idx[pp][0]));


//UPDATE CYCLE
  gettimeofday(&t1, NULL);
  for (int i = 0; i < N_UPDATES_TEST; i++){
    unsigned int vec_to_update = rand()%N_BUOYS;
    int nidx = lidx - h_buoy_data[vec_to_update].size();
    create_vec(vec_to_update, h_buoy_data[vec_to_update], h_buoy_idx[vec_to_update]);
    thrust::remove_if(&(buoy_data[pp][0]), &(buoy_data[pp][lidx]), buoy_idx[pp].begin(), equal_func(vec_to_update));
    thrust::remove_if(&(buoy_idx[pp][0]), &(buoy_idx[pp][lidx]), equal_func(vec_to_update));
    lidx = nidx + h_buoy_data[vec_to_update].size();
    thrust::device_vector temp_data = h_buoy_data[vec_to_update];
    thrust::device_vector temp_idx = h_buoy_idx[vec_to_update];
    int ppn = (pp == 0)?1:0;
    thrust::merge_by_key(&(buoy_data[pp][0]), &(buoy_data[pp][nidx]), temp_data.begin(), temp_data.end(), buoy_idx[pp].begin(), temp_idx.begin(), buoy_data[ppn].begin(), buoy_idx[ppn].begin() );
    pp = ppn; // update ping-pong buffer index
    prediction_kernel<<<1,1>>>(thrust::raw_pointer_cast(buoy_data[pp].data()), i, lidx);
  }
  gettimeofday(&t2, NULL);
  unsigned int tdiff_us = ((t2.tv_sec*1000000)+t2.tv_usec) - ((t1.tv_sec*1000000)+t1.tv_usec);
  printf("Completed %d updates in %f sec
", N_UPDATES_TEST, (float)tdiff_us/(float)1000000);
//  float *temps = (float *)malloc(N_UPDATES_TEST*sizeof(float));
//  cudaMemcpyFromSymbol(temps, dev_result, N_UPDATES_TEST*sizeof(float));
//  for (int i = 0; i < 100; i++) printf("temp %d: %f
", i, temps[i]);
  return 0;

}

Using CUDA 6, on linux, on a Quadro 5000 GPU, 1000 "updates" requires about 2 seconds. The majority of the time is spent in the calls to thrust::remove_if and thrust::merge_by_key I suppose for worst case real-time estimation, you would want to try and time the worst case update, which might be something like receiving a longest-possible update.

Clarification on the flow of realtime work to a GPU

Answers (1)

Related Questions