Display Name
Display Name

Reputation: 2403

OpenMP and explicit thread interoperability when using OpenCV

I'm using OpenCV in a commercial application, and don't have management permission to purchase TBB licensing, so I built OpenCV with it OpenMP as the parallelism framework.

All the machine vision cameras we use as sources of frames we're processing in real-time have SDKs that fill frame buffers in a circular queue with data and call user-supplied callbacks to process them concurrently in threads of the SDKs' own thread pools.

This works fine when not considering OpenMP, as I'm doing a bunch of (memoryless) processing on individual frames before serializing them though interthread buffers to feed to the stateful processing stage where frames need to be processed in order. If it was just the concurrent frame processing, then I wouldn't need OpenMP at all; however, I need to leave it enabled in OpenCV so that the in-order frame processing is accelerated as well.

The concern I have is how well I can expect the OpenMP to work when it's used in the first phase, the concurrently executed callbacks in the threads created explicitly by the camera SDKs. Can I assume the OpenMP runtime is smart enough to use its thread pool in an efficient manner when there are parallel regions being triggered in multiple externally created threads?

The platform is guaranteed to be x86-64 (VC++15 or GCC).‎

Upvotes: 0

Views: 619

Answers (2)

Display Name
Display Name

Reputation: 2403

Here's some example code that explains a workaround I found. LibraryFunction() represents some function I can't modify that already uses OpenMP parallelization, such as something from OpenCV.

void LibraryFunction(int phase, mutex &mtx)
{
    #pragma omp parallel num_threads(3)
    {
        lock_guard<mutex> l{mtx};
        cerr << "Phase: " << phase << "\tTID: " << this_thread::get_id() << "\tOMP: " << omp_get_thread_num() << endl;
    }
}

The problem of oversubscription with external threads is visible with this:

int main(void)
{
    omp_set_dynamic(thread::hardware_concurrency());
    omp_set_nested(0);
    vector<std::thread> threads;
    threads.reserve(3);
    mutex mtx;
    for (int i = 0; i < 3; ++i)
    {
        threads.emplace_back([&]
        {
            this_thread::sleep_for(chrono::milliseconds(200));
            LibraryFunction(1, mtx);
        });
    }
    for (auto &t : threads) t.join();
    cerr << endl;
    LibraryFunction(2, mtx);
    return EXIT_SUCCESS;
}

The output is:

Phase: 1        TID: 7812       OMP: 0
Phase: 1        TID: 3928       OMP: 0
Phase: 1        TID: 2984       OMP: 0
Phase: 1        TID: 9924       OMP: 1
Phase: 1        TID: 9560       OMP: 2
Phase: 1        TID: 2576       OMP: 1
Phase: 1        TID: 5380       OMP: 2
Phase: 1        TID: 3428       OMP: 1
Phase: 1        TID: 10096      OMP: 2

Phase: 2        TID: 9948       OMP: 0
Phase: 2        TID: 10096      OMP: 1
Phase: 2        TID: 3428       OMP: 2

Phase 1 represents execution of the OpenMPed library code in the camera SDK threads, whereas Phase 2 is OpenMPed library code used for later processing pipeline stages triggered from just one thread. The problem is obvious—the number of threads is multiplied by the nesting of OpenMP-in-external threads, and results in oversubscription during phase 1. Building OpenCV with OpenMP turned off, while fixing oversubscription in Phase 1, would on the other hand result in no acceleration of Phase 2.

The workaround I found is that wrapping the call to LibraryFunction() during phase 1 in #pragma omp parallel sections {} suppresses generation of threads within that particular call. Now the result is:

Phase: 1        TID: 3168       OMP: 0
Phase: 1        TID: 8888       OMP: 0
Phase: 1        TID: 5712       OMP: 0

Phase: 2        TID: 10232      OMP: 0
Phase: 2        TID: 5012       OMP: 1
Phase: 2        TID: 4224       OMP: 2

I haven't tested this with OpenCV yet, but I expect it to work.

Upvotes: 0

bazza
bazza

Reputation: 8414

Situation

If I've understood the question properly, the camera library you're using will spawn a number of threads, and each one of those will call your callback function. Inside your callback you want to use OpenMP to accelerate that processing. The results of that are sent through some interthread channel to a pipeline of threads doing more processing.

If that's wrong, please ignore the rest of this answer!

Rest Of Answer

Using OpenMP in the callbacks would seem to be chopping the compute load of this part of your application up into little pieces for not much benefit. The camera library is already overlapping the processing of frames. Using OpenMP here is going to mean that the processing of frames doesn't actually overlap (but the camera library is still using multiple threads as if it does).

If it does still overlap, then logically speaking you haven't got enough cores in your system to keep up with the overall workload anyway (assuming your use of OpenMP resulted in all cores being maxed out processing a single frame)... I'm assuming that your system is succesfully keeping up with the flow of frames, and thereofre must have enough grunt to be able to do so.

So, I think it won't really be a question of will OpenMP be intelligent in its use of its threadpool; the threadpool will be dedicated to processing a single frame, and it will complete before the next frame arrives.

The non-overalapping does mean the latency is lower, which might be what you want. However you could achieve the same thing if the camera library used a single thread with your callback using OpenMP (and taking on the responsibility to complete before the next frame arrives). With less thread context switching going on it would even be a tiny bit quicker. So if you can stop the library spawning all those threads (maybe there's a config parameter, or environment variable, or some other part of its API), it might be worth it.

Upvotes: 1

Related Questions