No Speedup achieved using Parallel_Scan Component of Intel Thread Building Blocks (TBB)

Question

I am exploring Parallel_Scan component in Intel Thread Building Blocks which is used in case of an associative operation and i am finding that Parallel_Scan is taking 10 times more than it would have been done in serial.

Code which i have written to check is:

#include 
#include 
#include 
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_scan.h"
#include "tbb/tick_count.h"

using namespace std;
using namespace tbb;

template 
class Body 
{
    T reduced_result;
    T* const y;
    const T* const x;

    public:

    Body( T y_[], const T x_[] ) : reduced_result(0), x(x_), y(y_) {}

    T get_reduced_result() const {return reduced_result;}

    template
    void operator()( const blocked_range& r, Tag ) 
    {
        T temp = reduced_result;

        for( int i=r.begin(); i
float DoParallelScan( T y[], const T x[], int n) 
{
    Body body(y,x);
    tick_count t1,t2,t3,t4;
    t1=tick_count::now();
    parallel_scan( blocked_range(0,n), body , auto_partitioner() );
    t2=tick_count::now();
    cout<<"Time Taken for parallel scan is 	"<<(t2-t1).seconds()<
float SerialScan(T1 y[], const T1 x[], int n)
{
    tick_count t3,t4;

    t3=tick_count::now();
    T1 temp = 10;

    for( int i=1; i



Please help me in finding where i am going wrong.

Arch D. Robison · Accepted Answer

I'm the original author of tbb::parallel_scan.

Getting speedup out of parallel scan on multicore systems using "big cores" can be hard. The reason is that parallel scan is inherently a two-pass algorithm. If the data does not fit in outer-level cache, parallel scan has to stream in data from memory twice, whereas the serial algorithm only has to do it once. For an operation as simple as integer addition, the memory traffic, not the ALU, is often the bottleneck for a "big core" that devotes a lot of hardware resources to fast serial execution. If the data does fit in outer-level cache, there might not be enough work to amortize parallel overheads.

I was able to get some parallel speedup (about 2x) for your example with the following changes and conditions:

I hoisted the read of r.end() into a local variable before the loop, like this:
```
int rend = r.end();
for( int i=r.begin(); i
```
Doing so helps the compiler generate better code because it knows that rend is loop-invariant. Without the hoisting, the compiler has to assume that writes to y[i] might overwrite the field of r that r.end() reads. It might help to similarly hoist the reads of fields x and y into local variables, though the compiler should be able to tell from type-based alias disambiguation that the writes to y[i] do not affect those fields.


I increased the input arrays to have 10,000,000 elements, so there is more work to do, and thus better amortize parallel scheduling overheads.  To avoid stack overflow, I allocated the arrays in the heap.
I warmed up the TBB run-time.  In general, when doing this sort of timing exercise, it is good to do a "throw away" run first so that one-time startup costs are not counted.  To do the warm up (for both serial and parallel code), I wrapped a three-trip loop around the timing logic, like this:

for( int k=0; k<3; ++k ) {
    cout<<"
 serial scan output is 	"<


This is something I do with most timing experiments, so I can see if first-time costs are significant or whether there is other variation of interest.

I compiled with "gcc -O2 -ltbb". 
I ran on a 16-core system with two "Sandy Bridge" chips.



A way to see the impact of memory bandwidth is to change T in your example to a smaller type.  When I edited the example to change T from int to char (thus reducing memory bandwidth demands by about 4x), the parallel speedup increased. (Aside: there is a "Body" in the example that should be "Body".)

Another way to see the impact of memory bandwidth is to try the example on a system with many small cores.  I tried the example, modified as previously described for type int, on an Intel(R) Xeon Phi(TM), which has high memory bandwidth and many small cores.  I can got 4x-7x parallel speedup.  Cranking up the problem size to 100,000,000 got me 10x-20x speedup.

To summarize: multi-threaded scan can pay off only when the benefits of parallel computation can outweigh the overhead of making two passes over the data.

No Speedup achieved using Parallel_Scan Component of Intel Thread Building Blocks (TBB)

Answers (2)

Related Questions