Parallel vector multiplication using several threads takes longer than sequential

Question

I have two functions, which do the multiplication of two vectors of integers (filled with all ones for now). I expect the function vector_multiplication_concurrent, which uses threads to be faster than the function vector_multiplication. However, it is actually a bit slower. I suspect that this is because only one thread works on result variable at a time, so the threads do not actually do the job in parallel. Is it correct? How should I change the code to get the parallel function to be faster?

The code:

#include 
#include 
#include 
#include 
#include 

void vector_multiplication(std::vector const & v1,
                           std::vector const & v2,
                           int & result) {

    for (int ind = 0; ind < v1.size(); ++ind) {
        result += v1[ind] * v2[ind];
    }

}

static std::mutex mtx;
void vector_multiplication_concurrent(std::vector const & v1,
                                     std::vector const & v2,
                                     int start_ind, int end_ind,
                                     int & result) {


    std::lock_guard lck(mtx);

    for (int ind = start_ind; ind <= end_ind; ++ind) {
        result += v1[ind] * v2[ind];
    }

}

int main(){

    std::vector v1 (10000000, 1);
    std::vector v2 (10000000, 1);

    int result = 0;

    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
    vector_multiplication(v1, v2, result);
    std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

    auto duration = std::chrono::duration_cast(t2 - t1).count();
    std::cout << "Duration: " << duration << '
';
    std::cout << "Product: " << result << '
';


    int result_concurrent = 0;
    int threads_num = 4;
    std::vector threads;

    std::chrono::high_resolution_clock::time_point t3 = std::chrono::high_resolution_clock::now();

    for (int th = 0; th < threads_num; ++th) {
        threads.push_back(std::thread(vector_multiplication_concurrent,
                                      std::ref(v1),
                                      std::ref(v2),
                                      th * (v1.size() / threads_num),
                                      th * (v1.size() / threads_num) + v1.size() / threads_num - 1,
                                      std::ref(result_concurrent)));
    }
    for (auto & th : threads) {
        th.join();
    }

    std::chrono::high_resolution_clock::time_point t4 = std::chrono::high_resolution_clock::now();

    auto duration_concurrent = std::chrono::duration_cast(t4 - t3).count();
    std::cout << "Duration concurrent: " << duration_concurrent << '
';
    std::cout << "Product concurrent: " << result_concurrent << '
';


    return 0;
}

463035818_is_not_an_ai · Accepted Answer

As mentioned in comments you are locking the mutex for the whole duration of the function, hence the code is actually sequential. You only need a mutex if several threads access the same memory and at least one is writing.

In case of summing vector elements you only need to have several threads writing to the same memory when adding the final result, hence you can change the function to:

static std::mutex mtx;
void vector_multiplication_concurrent(std::vector const & v1,
                                     std::vector const & v2,
                                     int start_ind, int end_ind,
                                     int & result) {

    // fully parallel part
    // v1 and v2 are shared, but you are only reading
    int temp = 0;

    for (int ind = start_ind; ind <= end_ind; ++ind) {
        temp += v1[ind] * v2[ind];
    }
    // only this requires you to synchronize access 
    // result is shared and you are writing to it
    std::lock_guard lck(mtx);
    result += temp;
}

PS: I would strongly suggest you to use iterators instead of indexes. Also note that your loop is basically a rewrite of std::inner_product. Using that instead of the plain loop will make your code more expressive.

Parallel vector multiplication using several threads takes longer than sequential

Answers (1)

Related Questions