MVTC
MVTC

Reputation: 855

Very Basic for loop using TBB

I am a very new programmer, and I have some trouble with the examples from intel. I think it would be helpful if I could see how the most basic possible loop is implemented in tbb.

for (n=0 ; n < songinfo.frames; ++n) {  

         sli[n]=songin[n*2];
         sri[n]=songin[n*2+1];

}

Here is a loop I am using to de-interleave audio data. Would this loop benefit from tbb? How would you implement it?

Upvotes: 3

Views: 2398

Answers (1)

Grizzly
Grizzly

Reputation: 20211

First of all for the following code I assume your arrays are of type mytype*, otherwise the code need some modifications. Furthermore I assume that your ranges don't overlap, otherwise parallelization attemps won't work correctly (at least not without more work)

Since you asked for it in tbb:

First you need to initialize the library somewhere (typically in your main). For the code assume I put a using namespace tbb somewhere.

int main(int argc, char *argv[]){
   task_scheduler_init init;
   ...
}

Then you will need a functor which captures your arrays and executes the body of the forloop:

struct apply_func {
    const mytype* songin; //whatever type you are operating on
    mytype* sli;
    mytype* sri;
    apply_func(const mytype* sin, mytype* sl, mytype* sr):songin(sin), sli(sl), sri(sr)
    {}
    void operator()(const blocked_range<size_t>& range) {
      for(size_t n = range.begin(); n !=range.end(); ++n){
        sli[n]=songin[n*2];
        sri[n]=songin[n*2+1];
      }
    }
}

Now you can use parallel_for to parallelize this loop:

size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
apply_func func(songin, sli, sri);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), func);

That should do it (if I remember correctly haven't looked at tbb in a while, so there might be small mistakes). If you use c++11, you can simplify the code by using lambda:

size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), 
             [&](const blocked_range<size_t>&){
                for(size_t n = range.begin(); n !=range.end(); ++n){
                  sli[n]=songin[n*2];
                  sri[n]=songin[n*2+1];
                }
             });

That being said tbb is not exactly what I would recommend for a new programmer. I would really suggest parallelizing only code which is trivial to parallelize until you have a very firm grip on threading. For this I would suggest using openmp which is quiet a bit simpler to start with then tbb, while still being powerfull enough to parallelize a lot of stuff (Depends on the compiler supporting it,though). For your loop it would look like the following:

#pragma omp prallel for
for(size_t n = 0; n < songinfo.frames; ++n) {
  sli[n]=songin[n*2];
  sri[n]=songin[n*2+1];
}

Then you have to tell your compiler to compile and link with openmp (-fopenmp for gcc, /openmp for visual c++). As you can see it is quite a bit simpler to use (for such easy usecases, more complex scenarious are a different matter) then tbb and has the added benefit of workingon plattforms which don't support openmp or tbb too (since unknown #pragmas are ignored by the compiler). Personally I'm using openmp in favor of tbb for some projects since I couldn't use it's open source license and buying tbb was a bit to steep for the projects.

Now that we have the how to parallize the loop out of the way, lets get to the question if it's worth it. This is a question which really can't be answered easily, since it completely depends on how many elements you process and what kind of platform your program is expected to run on. Your problem is very bandwidth heavy so I wouldn't count on to much of an increase in performance.

  • If you are only processing 1000 elements the parallel version of the loop is very likely to be slower then the single threaded version due to overhead.
  • If your data is not in the cache (because it doesn't fit) and your system is very bandwidth starved you might not see much of a benefit (although it's likely that you will see some benefit, just don't be supprised if its in the order of 1.X even if you use a lot of processors)
  • If your system is ccNUMA (likely for multisocket systems) your performance might decrease regardless of the amount of elements, due to additional transfercosts
  • The compiler might miss optimizations regarding pointer aliasing (since the loop body is moved to a different dunction). Using __restrict (for gcc, no clue for vs) might help with that problem.
  • ...

Personally I think the situation where you are most likely to see a significant performance increase is if your system has a single multi-core cpu, for which the dataset fit's into the L3-Cache (but not the individual L2 Caches). For bigger datasets your performance will probably increase, but not by much (and correctly using prefetching might get similar gains). Of course this is pure speculization.

Upvotes: 6

Related Questions