Reputation: 13
I am working on a problem where work between each thread may varies drastically, where, for example, a thread may this time handle 1000000 element, but another thread may only handle 1 or 2 element. So I stumbled upon this where the answer solve the unbalanced workload by using openmp task on CPU, so my question is can I achieve the same on CUDA ?
In case you want more context: The problem I'm trying to solve is, I have a n tuple, each has a starting point, an ending point and a value.
(0, 3, 1), (3, 6, 2), (6, 10, 3), ...
So for each tuple I want to write the value to every position between starting point and ending point of another empty array.
1, 1, 1, 2, 2, 2, 3, 3, 3, 3, ...
It is guaranteed that there is no start/ ending overlap.
My current approach is a thread for each tuple, but the starting and ending can vary a lot so the imbalanced workload between threads might cause a bottleneck for the program, though rare, but it may very well be.
Upvotes: 0
Views: 302
Reputation: 151799
The most common thread strategy I can think of in CUDA is to assign one thread per output point, and then have each thread do the work necessary to populate its output point.
For your stated objective (have each thread do roughly equal work) this is a useful strategy.
I will suggest using thrust for this. The basic idea is to:
I have used your data, the only change is that I changed the insert values from 1,2,3 to 5,2,7:
$ cat t1871.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
typedef thrust::tuple<int,int,int> mt;
// returns selected item from tuple
struct my_cpy_functor1
{
__host__ __device__ int operator()(mt d){ return thrust::get<1>(d); }
};
struct my_cpy_functor2
{
__host__ __device__ int operator()(mt d){ return thrust::get<2>(d); }
};
int main(){
mt my_data[] = {{0, 3, 5}, {3, 6, 2}, {6, 10, 7}};
int ds = sizeof(my_data)/sizeof(my_data[0]); // determine data size
int os = thrust::get<1>(my_data[ds-1]) - thrust::get<0>(my_data[0]); // and output size
thrust::device_vector<mt> d_data(my_data, my_data+ds); // transfer data to device
thrust::device_vector<int> d_idx(ds+1); // create index array for searching of insertion points
thrust::transform(d_data.begin(), d_data.end(), d_idx.begin()+1, my_cpy_functor1()); // set index array
thrust::device_vector<int> d_ins(os); // create array to hold insertion points
thrust::upper_bound(d_idx.begin(), d_idx.end(), thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(os), d_ins.begin()); // identify insertion points
thrust::transform(thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.begin(), _1 -1)), thrust::make_permutation_iterator(d_data.begin(), thrust::make_transform_iterator(d_ins.end(), _1 -1)), d_ins.begin(), my_cpy_functor2()); // insert
thrust::copy(d_ins.begin(), d_ins.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t1871 t1871.cu -std=c++14
$ ./t1871
5,5,5,2,2,2,7,7,7,7,
$
Upvotes: 1