PSchn
PSchn

Reputation: 728

tbb::parallel_reduce vs tbb::combinable vs tbb::enumerable_thread_specific

I want to go through an image and process some specific values with regard to the order of the elements. The image has one unsigned char* array containing a mask(255 if pixel should be processed, else 0) and an unsigned short* array with the pixel values.

I implemented three different methods with tbb and used a single for-loop through the mask-array and calculated the x,y-coordinates from the loop-variable: x = i%width; y = i/width;. If the pixel is visible i want to transform the point using Eigen. The vector4d is a std::vector<std::array<double,4>> to store the points.

Here are my three implementaion with tbb:

1. tbb::combinable and tbb::parallel_for :

void Combinable(int width, int height, unsigned char* mask,unsigned short*  pixel){ 
    MyCombinableType.clear();
    MyCombinableType.local().reserve(width*height);
    tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
        [&](const tbb::blocked_range<int> &r) 
    {       
        vector4d& local = MyCombinableType.local(); 
        const size_t end = r.end(); 
        for (int i = r.begin(); i != end; ++i)
        {
            if(mask[i]!=0)
            {                                       
                array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                //Map with Eigen and transform
                local.push_back(arr);           
            }
        }
    });

    vector4d idx = MyCombinableType.combine(
        []( vector4d x, vector4d y) 
    {               
        std::size_t n = x.size();
        x.resize(n + y.size());
        std::move(y.begin(), y.end(), x.begin() + n);
        return x;
    });
}

2. tbb::enumerable_thread_specific and tbb::parallel_for:

void Enumerable(int width, int height, unsigned char* mask,unsigned short*  pixel){
    MyEnumerableType.clear();
    MyEnumerableType.local().reserve(width*height);
    tbb::parallel_for( tbb::blocked_range<int>(0, width*height),
        [&](const tbb::blocked_range<int> &r) 
    {
        enumerableType::reference local = MyEnumerableType.local();
        for (int i = r.begin(); i != r.end(); ++i)
        {
            if(mask[i]!=0)
            {
                array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                //Map with Eigen and transform
                local.push_back(arr);               

            }
        }
    });

    vector4d idx = MyEnumerableType.combine(
        [](vector4d x, vector4d y) 
    {           
        std::size_t n = x.size();
        x.resize(n + y.size());
        std::move(y.begin(), y.end(), x.begin() + n);
        return x;
    });
}

3. tbb::parallel_reduce:

void Reduce(int width, int height, unsigned char* mask,unsigned short*  pixel){
    vector4d idx = tbb::parallel_reduce(
        tbb::blocked_range<int>(0, width*height ),vector4d(),
            [&](const tbb::blocked_range<int>& r, vector4d init)->vector4d 
        {
            const size_t end = r.end(); 
            init.reserve(r.size());
            for( int i=r.begin(); i!=end; ++i )
            {   
                if(mask[i]!=0)
                {               
                    array4d arr = {i%width,i/width,(double)pixel[i],1}; 
                    //Map with Eigen and transform
                    init.push_back(arr);            
                }
            }
            return init;
        },
        []( vector4d x,vector4d y )
        {
            std::size_t n = x.size();
            x.resize(n + y.size());
            std::move(y.begin(), y.end(), x.begin() + n);           
            return x;
        }
    );  
}

I compared the runtime of the three versions with a serial implementation. The arrays had 8400000 elements and every algortihm was repeated 100 times. The results are:

I assume that the combine statement is the bottleneck here. What am i doing wrong? Why is parallel_reduce soo much slower? Please help!

Upvotes: 1

Views: 1864

Answers (2)

atb
atb

Reputation: 1462

You are using the functional form of parallel_reduce, try the more efficient imperative form instead. Unfortunately it cannot be called using lambdas, you must define a Body class:

https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_reduce_func.html

It should minimize the number vector4d copies that are made during your reduction. The vector4d should be a member of your Body class so that it can be reused and appended to by multiple ranges, rather than constructing and merging a unique vector4d for every subdivided range.

(Note: the splitting constructor should NOT copy the contents of the vector4d member, notice how value is always initialized to 0 in intel's example above.)

Upvotes: 0

Anton
Anton

Reputation: 6557

There are few optimizations you can apply here.

  1. avoid excessive copying: pass const vector4d& instead, use [&] lambdas everywhere.
  2. Use temporary vector4d on stack instead of resizing one of the arguments and use it for return statement.
  3. Generally, use blocked_range2d instead of calculating x = i%width; y = i/width. This is not only optimizes out excessive computations but, which is much more important, it optimizes cache access pattern that might improve cache usage (not in this case though).

Upvotes: 1

Related Questions