JohannB
JohannB

Reputation: 377

TBB Parallel Pipeline: Filter Timing Inconsistent

I'm programming an application that processes a video stream using a `tbb::parallel_pipeline'. My first filter contains two important operations, one that must occur immediately after the other.

My tests show that the delay between two operations is anywhere from 3 to 20 milliseconds when I set max_number_of_live_tokens to 6 (# of filters I have) but is consistently 3 to 4 milliseconds when max_number_of_live_tokens is 1. The jitter in the first case is unacceptable for my application, but I need to allow multiple tokens to be in flight simultaneously to exploit parallelism.

Here my pipeline setup:

tbb::parallel_pipeline(6, //max_number_of_live_tokens

        // 1st Filter
        tbb::make_filter< void, shared_ptr<PipelinePacket_t> >(tbb::filter::serial_in_order,
            [&](tbb::flow_control& fc)->shared_ptr<PipelinePacket_t>
                { 
                    shared_ptr<PipelinePacket_t> pPacket = grabFrame();
                    return pPacket; 
                }
            )

        & 
           ... // 5 other filters that process the image - all 'serial_in_order'

    );

And here is my grabFrame() function:

shared_ptr<VisionPipeline::PipelinePacket_t> VisionPipeline::grabFrame() {

    shared_ptr<PipelinePacket_t> pPacket(new PipelinePacket_t);

    m_cap >> pPacket->frame; // Operation A (use opencv api to capture frame)

    pPacket->motion.gyroDeg = m_imu.getGyroZDeg(); // Operation B (read a gyro value)

    return pPacket;
}

I need operations A and B to happen as close as possible to each other (so that the gyro value reflects its value at the time the frame was captured).

My guess is that the jitter that occurs when multiple tokens are in flight simultaneously is caused by tasks from other filters running on the same thread as the first filter and interrupting it while grabFrame() executing. I've dug through some TBB documentation but can't find anything on how parallel_pipeline breaks up filters into tasks, so it is unclear to me if TBB is somehow breaking up grabFrame() into multiple TBB tasks or not.

Is this assumption correct? If so, how can I tell tbb not to interrupt the first filter between operations A and B with other tasks?

Upvotes: 3

Views: 516

Answers (1)

Ext3h
Ext3h

Reputation: 6393

OpenCV is using TBB internally itself, for various operations. So if this is actually related to TBB, it's not as you were interrupted between A and B, but rather OpenCV itself is fighting for priority with the remainder of the filter chain. Unlikely though.

so it is unclear to me if TBB is somehow breaking up grabFrame() into multiple TBB tasks or not.

That is never happening. Unless there are parts in there explicitly dispatching via TBB, it has no effect whatsoever. TBB is not magically splitting your functions into tasks.

But that may not even be your only issue. If your filters happen to be heavy on memory bandwidth, it's likely the case that you are slowing down the actual capture process significantly just by concurrent execution of the image processing.

Looks like you are running the full image through 5 filters in a row, is that correct? Full resolution, not tiled? If so, most of these filter are likely not ALU constrained, but rather by memory bandwidth, as you are not staying within CPU cache bounds either.

If you wish to go parallel, you must get rid of the write-backs to main memory in between the filter stages. The only way to do that, is to either start tiling the images in the input filter of the filter chain, or to write a custom all-in-one filter kernel. If you have filters in that chain with spatial dependencies, that's obviously not as easy as I make it sound, then you have to include some overlap in the upper stages.

max_number_of_live_tokens then actually has a real meaning. It's the number of "tiles" in flight. Which is not primarily intended to limit each filter stage to only one concurrent execution (that's not happening anyway), but rather to keep the maximum working set size under control.

E.g. if you know that each of your tiles is now 128kB in size, you know that there are 2 copies involved in each filter (source and destination), and you know you have a 2MB L3 cache, then you would know that you can afford to have 8 tokens in flight without spilling to main memory. If you also happen to have (at least) 8 CPU cores, that yields ideal throughput, but even if you don't, at least you are not risking to become bottle-necked by exceeding cache size. Of course you can afford some spilling to main memory (past what you calculated to be safe), but then you have to perform in-depth profiling of your system to see if you are getting constrained.

Upvotes: 5

Related Questions