Reputation: 855
I am a very new programmer, and I have some trouble with the examples from intel. I think it would be helpful if I could see how the most basic possible loop is implemented in tbb.
for (n=0 ; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
Here is a loop I am using to de-interleave audio data. Would this loop benefit from tbb? How would you implement it?
Upvotes: 3
Views: 2398
Reputation: 20211
First of all for the following code I assume your arrays
are of type mytype*
, otherwise the code need some modifications. Furthermore I assume that your ranges don't overlap, otherwise parallelization attemps won't work correctly (at least not without more work)
Since you asked for it in tbb:
First you need to initialize the library somewhere (typically in your main
). For the code assume I put a using namespace tbb
somewhere.
int main(int argc, char *argv[]){
task_scheduler_init init;
...
}
Then you will need a functor which captures your arrays and executes the body of the forloop:
struct apply_func {
const mytype* songin; //whatever type you are operating on
mytype* sli;
mytype* sri;
apply_func(const mytype* sin, mytype* sl, mytype* sr):songin(sin), sli(sl), sri(sr)
{}
void operator()(const blocked_range<size_t>& range) {
for(size_t n = range.begin(); n !=range.end(); ++n){
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
}
}
Now you can use parallel_for
to parallelize this loop:
size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
apply_func func(songin, sli, sri);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize), func);
That should do it (if I remember correctly haven't looked at tbb in a while, so there might be small mistakes).
If you use c++11, you can simplify the code by using lambda
:
size_t grainsize = 1000; //or whatever you decide on (testing required for best performance);
parallel_for(blocked_range<size_t>(0, songinfo.frames, grainsize),
[&](const blocked_range<size_t>&){
for(size_t n = range.begin(); n !=range.end(); ++n){
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
});
That being said tbb is not exactly what I would recommend for a new programmer. I would really suggest parallelizing only code which is trivial to parallelize until you have a very firm grip on threading. For this I would suggest using openmp
which is quiet a bit simpler to start with then tbb, while still being powerfull enough to parallelize a lot of stuff (Depends on the compiler supporting it,though). For your loop it would look like the following:
#pragma omp prallel for
for(size_t n = 0; n < songinfo.frames; ++n) {
sli[n]=songin[n*2];
sri[n]=songin[n*2+1];
}
Then you have to tell your compiler to compile and link with openmp (-fopenmp
for gcc, /openmp
for visual c++). As you can see it is quite a bit simpler to use (for such easy usecases, more complex scenarious are a different matter) then tbb and has the added benefit of workingon plattforms which don't support openmp or tbb too (since unknown #pragmas
are ignored by the compiler). Personally I'm using openmp in favor of tbb for some projects since I couldn't use it's open source license and buying tbb was a bit to steep for the projects.
Now that we have the how to parallize the loop out of the way, lets get to the question if it's worth it. This is a question which really can't be answered easily, since it completely depends on how many elements you process and what kind of platform your program is expected to run on. Your problem is very bandwidth heavy so I wouldn't count on to much of an increase in performance.
1000
elements the parallel version of the loop is very likely to be slower then the single threaded version due to overhead. 1.X
even if you use a lot of processors)__restrict
(for gcc, no clue for vs) might help with that problem.Personally I think the situation where you are most likely to see a significant performance increase is if your system has a single multi-core cpu, for which the dataset fit's into the L3-Cache (but not the individual L2 Caches). For bigger datasets your performance will probably increase, but not by much (and correctly using prefetching might get similar gains). Of course this is pure speculization.
Upvotes: 6