Reputation: 728
I want to go through an image and process some specific values with regard to the order of the elements. The image has one unsigned char*
array containing a mask(255 if pixel should be processed, else 0) and an unsigned short*
array with the pixel values.
I implemented three different methods with tbb and used a single for-loop through the mask-array and calculated the x,y-coordinates from the loop-variable: x = i%width; y = i/width;
. If the pixel is visible i want to transform the point using Eigen
.
The vector4d
is a std::vector<std::array<double,4>>
to store the points.
Here are my three implementaion with tbb:
1. tbb::combinable
and tbb::parallel_for
:
void Combinable(int width, int height, unsigned char* mask,unsigned short* pixel){
MyCombinableType.clear();
MyCombinableType.local().reserve(width*height);
tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
[&](const tbb::blocked_range<int> &r)
{
vector4d& local = MyCombinableType.local();
const size_t end = r.end();
for (int i = r.begin(); i != end; ++i)
{
if(mask[i]!=0)
{
array4d arr = {i%width,i/width,(double)pixel[i],1};
//Map with Eigen and transform
local.push_back(arr);
}
}
});
vector4d idx = MyCombinableType.combine(
[]( vector4d x, vector4d y)
{
std::size_t n = x.size();
x.resize(n + y.size());
std::move(y.begin(), y.end(), x.begin() + n);
return x;
});
}
2. tbb::enumerable_thread_specific
and tbb::parallel_for
:
void Enumerable(int width, int height, unsigned char* mask,unsigned short* pixel){
MyEnumerableType.clear();
MyEnumerableType.local().reserve(width*height);
tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
[&](const tbb::blocked_range<int> &r)
{
enumerableType::reference local = MyEnumerableType.local();
for (int i = r.begin(); i != r.end(); ++i)
{
if(mask[i]!=0)
{
array4d arr = {i%width,i/width,(double)pixel[i],1};
//Map with Eigen and transform
local.push_back(arr);
}
}
});
vector4d idx = MyEnumerableType.combine(
[](vector4d x, vector4d y)
{
std::size_t n = x.size();
x.resize(n + y.size());
std::move(y.begin(), y.end(), x.begin() + n);
return x;
});
}
3. tbb::parallel_reduce
:
void Reduce(int width, int height, unsigned char* mask,unsigned short* pixel){
vector4d idx = tbb::parallel_reduce(
tbb::blocked_range<int>(0, width*height ),vector4d(),
[&](const tbb::blocked_range<int>& r, vector4d init)->vector4d
{
const size_t end = r.end();
init.reserve(r.size());
for( int i=r.begin(); i!=end; ++i )
{
if(mask[i]!=0)
{
array4d arr = {i%width,i/width,(double)pixel[i],1};
//Map with Eigen and transform
init.push_back(arr);
}
}
return init;
},
[]( vector4d x,vector4d y )
{
std::size_t n = x.size();
x.resize(n + y.size());
std::move(y.begin(), y.end(), x.begin() + n);
return x;
}
);
}
I compared the runtime of the three versions with a serial implementation. The arrays had 8400000 elements and every algortihm was repeated 100 times. The results are:
I assume that the combine
statement is the bottleneck here. What am i doing wrong? Why is parallel_reduce
soo much slower? Please help!
Upvotes: 1
Views: 1864
Reputation: 1462
You are using the functional form of parallel_reduce, try the more efficient imperative form instead. Unfortunately it cannot be called using lambdas, you must define a Body class:
https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_reduce_func.html
It should minimize the number vector4d copies that are made during your reduction. The vector4d should be a member of your Body class so that it can be reused and appended to by multiple ranges, rather than constructing and merging a unique vector4d for every subdivided range.
(Note: the splitting constructor should NOT copy the contents of the vector4d member, notice how value
is always initialized to 0 in intel's example above.)
Upvotes: 0
Reputation: 6557
There are few optimizations you can apply here.
const vector4d&
instead, use [&]
lambdas everywhere.vector4d
on stack instead of resizing one of the arguments and use it for return statement.blocked_range2d
instead of calculating x = i%width; y = i/width
. This is not only optimizes out excessive computations but, which is much more important, it optimizes cache access pattern that might improve cache usage (not in this case though).Upvotes: 1