CUDA equivalent of pragma omp task

Question

I am working on a problem where work between each thread may varies drastically, where, for example, a thread may this time handle 1000000 element, but another thread may only handle 1 or 2 element. So I stumbled upon this where the answer solve the unbalanced workload by using openmp task on CPU, so my question is can I achieve the same on CUDA ?

In case you want more context: The problem I'm trying to solve is, I have a n tuple, each has a starting point, an ending point and a value.

(0, 3, 1), (3, 6, 2), (6, 10, 3), ...

So for each tuple I want to write the value to every position between starting point and ending point of another empty array.

1, 1, 1, 2, 2, 2, 3, 3, 3, 3, ...

It is guaranteed that there is no start/ ending overlap.

My current approach is a thread for each tuple, but the starting and ending can vary a lot so the imbalanced workload between threads might cause a bottleneck for the program, though rare, but it may very well be.

Robert Crovella · Accepted Answer

The most common thread strategy I can think of in CUDA is to assign one thread per output point, and then have each thread do the work necessary to populate its output point.

For your stated objective (have each thread do roughly equal work) this is a useful strategy.

I will suggest using thrust for this. The basic idea is to:

determine the necessary size of the output based on the input
spin up a set of threads equal to the output size, where each thread determines its "insert index" in the output array by using a vectorized binary search on the input
with the insert index, insert the appropriate value in the output array.

I have used your data, the only change is that I changed the insert values from 1,2,3 to 5,2,7:

$ cat t1871.cu
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

using namespace thrust::placeholders;

typedef thrust::tuple mt;

// returns selected item from tuple
struct my_cpy_functor1
{
  __host__ __device__ int operator()(mt d){ return thrust::get<1>(d); }
};
struct my_cpy_functor2
{
  __host__ __device__ int operator()(mt d){ return thrust::get<2>(d); }
};

int main(){

  mt my_data[] = {{0, 3, 5}, {3, 6, 2}, {6, 10, 7}};
  int ds = sizeof(my_data)/sizeof(my_data[0]); // determine data size
  int os = thrust::get<1>(my_data[ds-1]) - thrust::get<0>(my_data[0]); // and output size
  thrust::device_vector d_data(my_data, my_data+ds);  // transfer data to device
  thrust::device_vector d_idx(ds+1); // create index array for searching of insertion points
  thrust::transform(d_data.begin(), d_data.end(), d_idx.begin()+1, my_cpy_functor1()); // set index array
  thrust::device_vector d_ins(os);  // create array to hold insertion points
  thrust::upper_bound(d_idx.begin(), d_idx.end(), thrust::counting_iterator(0), thrust::counting_iterator(os), d_ins.begin()); // identify insertion points
  thrust::transform(thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.begin(), _1 -1)), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.end(), _1 -1)), d_ins.begin(), my_cpy_functor2()); // insert
  thrust::copy(d_ins.begin(), d_ins.end(), std::ostream_iterator(std::cout, ","));
  std::cout << std::endl;
}


$ nvcc -o t1871 t1871.cu -std=c++14
$ ./t1871
5,5,5,2,2,2,7,7,7,7,
$

CUDA equivalent of pragma omp task

Answers (1)

Related Questions