user_185051
user_185051

Reputation: 436

Parallel vector multiplication using several threads takes longer than sequential

I have two functions, which do the multiplication of two vectors of integers (filled with all ones for now). I expect the function vector_multiplication_concurrent, which uses threads to be faster than the function vector_multiplication. However, it is actually a bit slower. I suspect that this is because only one thread works on result variable at a time, so the threads do not actually do the job in parallel. Is it correct? How should I change the code to get the parallel function to be faster?

The code:

#include <iostream>
#include <chrono>
#include <vector>
#include <thread>
#include <mutex>

void vector_multiplication(std::vector<int> const & v1,
                           std::vector<int> const & v2,
                           int & result) {

    for (int ind = 0; ind < v1.size(); ++ind) {
        result += v1[ind] * v2[ind];
    }

}

static std::mutex mtx;
void vector_multiplication_concurrent(std::vector<int> const & v1,
                                     std::vector<int> const & v2,
                                     int start_ind, int end_ind,
                                     int & result) {


    std::lock_guard<std::mutex> lck(mtx);

    for (int ind = start_ind; ind <= end_ind; ++ind) {
        result += v1[ind] * v2[ind];
    }

}

int main(){

    std::vector<int> v1 (10000000, 1);
    std::vector<int> v2 (10000000, 1);

    int result = 0;

    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
    vector_multiplication(v1, v2, result);
    std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
    std::cout << "Duration: " << duration << '\n';
    std::cout << "Product: " << result << '\n';


    int result_concurrent = 0;
    int threads_num = 4;
    std::vector<std::thread> threads;

    std::chrono::high_resolution_clock::time_point t3 = std::chrono::high_resolution_clock::now();

    for (int th = 0; th < threads_num; ++th) {
        threads.push_back(std::thread(vector_multiplication_concurrent,
                                      std::ref(v1),
                                      std::ref(v2),
                                      th * (v1.size() / threads_num),
                                      th * (v1.size() / threads_num) + v1.size() / threads_num - 1,
                                      std::ref(result_concurrent)));
    }
    for (auto & th : threads) {
        th.join();
    }

    std::chrono::high_resolution_clock::time_point t4 = std::chrono::high_resolution_clock::now();

    auto duration_concurrent = std::chrono::duration_cast<std::chrono::microseconds>(t4 - t3).count();
    std::cout << "Duration concurrent: " << duration_concurrent << '\n';
    std::cout << "Product concurrent: " << result_concurrent << '\n';


    return 0;
}

Upvotes: 2

Views: 710

Answers (1)

463035818_is_not_an_ai
463035818_is_not_an_ai

Reputation: 122460

As mentioned in comments you are locking the mutex for the whole duration of the function, hence the code is actually sequential. You only need a mutex if several threads access the same memory and at least one is writing.

In case of summing vector elements you only need to have several threads writing to the same memory when adding the final result, hence you can change the function to:

static std::mutex mtx;
void vector_multiplication_concurrent(std::vector<int> const & v1,
                                     std::vector<int> const & v2,
                                     int start_ind, int end_ind,
                                     int & result) {

    // fully parallel part
    // v1 and v2 are shared, but you are only reading
    int temp = 0;

    for (int ind = start_ind; ind <= end_ind; ++ind) {
        temp += v1[ind] * v2[ind];
    }
    // only this requires you to synchronize access 
    // result is shared and you are writing to it
    std::lock_guard<std::mutex> lck(mtx);
    result += temp;
}

PS: I would strongly suggest you to use iterators instead of indexes. Also note that your loop is basically a rewrite of std::inner_product. Using that instead of the plain loop will make your code more expressive.

Upvotes: 4

Related Questions