Steffen Binas
Steffen Binas

Reputation: 1488

TBB::parallel_for creates too many class/body copies?

I followed the basic parallel_for example of TBB. The documentation states:

Template function parallel_for requires that the body object have a copy constructor, which is invoked to create a separate copy (or copies) for each worker thread.

My algorithm needs some memory per concurrent worker to operate. I now allocate the memory in the copy constructor. It works, but these are the numbers on my 8 thread-machine: On a range of 0-10000 I get about 2000 work chunks (calls of operator()) and the copy constructor is called about 300 times! That's the problem: 300 memory allocations where only 8 are needed. I checked that there are only 8 threads running, and definitely not used more than 8 class copies concurrently.

Am I completely wrong assuming that the number of copies correlates with the number of threads? Is there a better way to allocate the memory?

#include "tbb/tbb.h"

using namespace tbb;

class ApplyFoo {
    float *const my_a;
public:
    void operator()( const blocked_range<size_t>& r ) const {
        float *a = my_a;
        for( size_t i=r.begin(); i!=r.end(); ++i ) 
           Foo(a[i]); // Foo uses the allocated memory
    }
    ApplyFoo( float a[] ) :
        my_a(a)
    {}

    // the Copy-Constructor is called work every 
    ApplyFoo( const ApplyFoo& other ) :
        my_a(a)
    {
      // Allocate some memory here...
    }

    ~ApplyFoo() 
    {
      // Free the memory here...
    }
};

void ParallelApplyFoo( float a[], size_t n ) {
    parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));
}

Upvotes: 1

Views: 959

Answers (1)

Anton
Anton

Reputation: 6587

Am I completely wrong assuming that the number of copies correlates with the number of threads?

You are right to assume the correlation for used default partitioner (auto_partitioner), but the multiplier is big enough and depends on run-time conditions thus the number of copies can be as big as the number of subranges. So, there is no surprise.

However, the number of subranges can be controlled by specifying the gain-size:

size_t p = task_scheduler_init::default_num_threads();
size_t grainsize = 2*n/p-1;
parallel_for(blocked_range<size_t>(0,n,grainsize), ApplyFoo(a));

The computation 2*n/p-1 here is because in TBB, grainsize is not a minimal size of a possible sub-range but the threshold used to decide whether to split.

Also, for the completely predictable behavior of the partitioner with the number of parallel_for body copies (independently from run-time conditions), use the simple_partitioner instead:

parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a), simple_partitioner());

Though, it can lead to additional overheads for the big ranges and small grain-sizes since it does not aggregate the ranges.

Is there a better way to allocate the memory?

Yes, and the grain-size is not a good way for this since it prevents TBB scheduler from better load-balancing. I recommend using thread local storage containers instead. Unlike compiler-based TLS, it is possible to traverse over the values in order to clean up the memory in one place and even if the origin thread is gone.

Upvotes: 1

Related Questions