Reputation: 1171

C++: Thread pool slower than single threading?

First of all I did look at the other topics on this website and found they don't relate to my problem as those mostly deal with people using I/O operations or thread creation overheads. My problem is that my threadpool or worker-task structure implementation is (in this case) a lot slower than single threading. I'm really confused by this and not sure if it's the ThreadPool, the task itself, how I test it, the nature of threads or something out of my control.

// Sorry for the long code
#include <vector>
#include <queue>

#include <thread>
#include <mutex>
#include <future>

#include "task.hpp"

class ThreadPool
{
public:
    ThreadPool()
    {
        for (unsigned i = 0; i < std::thread::hardware_concurrency() - 1; i++)
            m_workers.emplace_back(this, i);

        m_running = true;
        for (auto&& worker : m_workers)
            worker.start();
    }
    ~ThreadPool()
    {
        m_running = false;
        m_task_signal.notify_all();
        for (auto&& worker : m_workers)
            worker.terminate();
    }

    void add_task(Task* task)
    {
        {
            std::unique_lock<std::mutex> lock(m_in_mutex);
            m_in.push(task);
        }
        m_task_signal.notify_one();
    }
private:
    class Worker
    {
    public:
        Worker(ThreadPool* parent, unsigned id) : m_parent(parent), m_id(id)
        {}
        ~Worker()
        {
            terminate();
        }

        void start()
        {
            m_thread = new std::thread(&Worker::work, this);
        }
        void terminate()
        {
            if (m_thread)
            {
                if (m_thread->joinable())
                {
                    m_thread->join();
                    delete m_thread;
                    m_thread = nullptr;
                    m_parent = nullptr;
                }
            }
        }
    private:
        void work()
        {
            while (m_parent->m_running)
            {               
                std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
                m_parent->m_task_signal.wait(lock, [&]()
                {
                    return !m_parent->m_in.empty() || !m_parent->m_running;
                });

                if (!m_parent->m_running) break;
                Task* task = m_parent->m_in.front();
                m_parent->m_in.pop();
                // Fixed the mutex being locked while the task is executed
                lock.unlock();

                task->execute();            
            }
        }
    private:
        ThreadPool* m_parent = nullptr;
        unsigned m_id = 0;

        std::thread* m_thread = nullptr;
    };
private:
    std::vector<Worker> m_workers;

    std::mutex m_in_mutex;
    std::condition_variable m_task_signal;
    std::queue<Task*> m_in;

    bool m_running = false;
};

class TestTask : public Task
{
public:
    TestTask() {}
    TestTask(unsigned number) : m_number(number) {}

    inline void Set(unsigned number) { m_number = number; }

    void execute() override
    {
        if (m_number <= 3)
        {
            m_is_prime = m_number > 1;
            return;
        }
        else if (m_number % 2 == 0 || m_number % 3 == 0)
        {
            m_is_prime = false;
            return;
        }
        else
        {
            for (unsigned i = 5; i * i <= m_number; i += 6)
            {
                if (m_number % i == 0 || m_number % (i + 2) == 0)
                {
                    m_is_prime = false;
                    return;
                }
            }
            m_is_prime = true;
            return;
        }
    }
public:
    unsigned m_number = 0;
    bool m_is_prime = false;
};

int main()
{
    ThreadPool pool;

    unsigned num_tasks = 1000000;
    std::vector<TestTask> tasks(num_tasks);
    for (auto&& task : tasks)
        task.Set(randint(0, 1000000000));

    auto s = std::chrono::high_resolution_clock::now();
    #if MT
    for (auto&& task : tasks)
        pool.add_task(&task);
    #else
    for (auto&& task : tasks)
        task.execute();
    #endif
    auto e = std::chrono::high_resolution_clock::now();
    double seconds = std::chrono::duration_cast<std::chrono::nanoseconds>(e - s).count() / 1000000000.0;
}

Benchmarks with VS2013 Profiler:

10,000,000 tasks:
    MT:
        13 seconds of wall clock time
        93.36% is spent in msvcp120.dll
        3.45% is spent in Task::execute() // Not good here
    ST:
        0.5 seconds of wall clock time
        97.31% is spent with Task::execute()

Upvotes: 7

Answers (3)

Craig Estey

Reputation: 33601

For MT, how much time is spent in each phase of the "overhead": std::unique_lock, m_task_signal.wait, front, pop, unlock?

Based on your results of only 3% useful work, this means the above consumes 97%. I'd get numbers for each part of the above (e.g. add timestamps between each call).

It seems to me, that the code you use to [merely] dequeue the next task pointer is quite heavy. I'd do a much simpler queue [possibly lockless] mechanism. Or, perhaps, use atomics to bump an index into the queue instead of the five step process above. For example:

void
work()
{
    while (m_parent->m_running) {
        // NOTE: this is just an example, not necessarily the real function
        int curindex = atomic_increment(&global_index);
        if (curindex >= max_index)
            break;

        Task *task = m_parent->m_in[curindex];

        task->execute();
    }
}

Also, maybe you should pop [say] ten at a time instead of just one.

You might also be memory bound and/or "task switch" bound. (e.g.) For threads that access an array, more than four threads usually saturates the memory bus. You could also have heavy contention for the lock, such that the threads get starved because one thread is monopolizing the lock [indirectly, even with the new unlock call]

Interthread locking usually involves a "serialization" operation where other cores must synchronize their out-of-order execution pipelines.

Here's a "lockless" implementation:

void
work()
{
    // assume m_id is 0,1,2,...
    int curindex = m_id;

    while (m_parent->m_running) {
        if (curindex >= max_index)
            break;

        Task *task = m_parent->m_in[curindex];

        task->execute();

        curindex += NUMBER_OF_WORKERS;
    }
}

Upvotes: 1

ixSci

Reputation: 13698

Usual disclaimer in such answers: the only way to tell for sure is to measure it with a profiler tool.

But I will try to explain your results without it. First of all, you have one mutex across all your threads. So only one thread at a time can execute some task. It kills all your gains you might have. In spite of your threads your code is perfectly serial. So at the very least make your task execution out of the mutex. You need to lock the mutex only to get a task out of the queue — you don't need to hold it when the task gets executed.

Next, your tasks are so simple that single thread will execute them in no time. You just can't measure any gains with such tasks. Create some heavy tasks which could produce some more interesting results(some tasks which are closer to the real world, not such contrived).

And the 3rd point: threads are not without their cost — context switching, mutex contention etc. To have real gains, as the previous 2 points say, you need to have tasks which take more time than the overheads threads introduce and the code should be truly parallel instead of waiting on some resource making it serial.

UPD: I looked at the wrong part of the code. The task is complex enough provided you create tasks with sufficiently large numbers.

UPD2: I've played with your code and found a good prime number to show how the MT code is better. Use the following prime number: 1019048297. It will give enough computation complexity to show the difference.

But why your code doesn't produce good results? It is hard to tell without seeing the implementation of randint() but I take it is pretty simple and in a half of the cases it returns even numbers and other cases produce not much of big prime numbers either. So the tasks are so simple that context switching and other things around your particular implementation and threads in general consume more time than the computation itself. Using the prime number I gave you give the tasks no choice but spend time computing — no easy answer since the number is big and actually prime. That's why the big number will give you the answer you seek — better time for the MT code.

Upvotes: 6

CJCombrink

Reputation: 3950

You should not hold the mutex while the task is getting executed, otherwise other threads will not be able to get a task:

void work() {
    while (m_parent->m_running) {   
        Task* currentTask = nullptr;    
        std::unique_lock<std::mutex> lock(m_parent->m_in_mutex);
        m_parent->m_task_signal.wait(lock, [&]() {
            return !m_parent->m_in.empty() || !m_parent->m_running;
        });                     
        if (!m_parent->m_running) continue;
        currentTask = m_parent->m_in.front();
        m_parent->m_in.pop();               
        lock.unlock(); //<- Release the lock so that other threads can get tasks
        currentTask->execute();
        currentTask = nullptr;
    }   
}

Upvotes: 2

C++: Thread pool slower than single threading?

Answers (3)

Related Questions