minomic
minomic

Reputation: 665

Filling a vector with multiple threads

I need to fill a huge (7734500 elements) std::vector<unsigned int> with random values, and I am trying to do it in parallel with multiple threads to achieve a higher efficiency. Here is the code that I have so far:

std::random_device rd; // seed generator

std::mt19937_64 generator{rd()}; // generator initialized with seed from rd

static const unsigned int NUM_THREADS = 4;


std::uniform_int_distribution<> initialize(unsigned long long int modulus)
{
    std::uniform_int_distribution<> unifDist{0, (int)(modulus-1)};
    return unifDist;
}


void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end, std::uniform_int_distribution<>& dist)
{
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = dist(generator);
    }
}


std::vector<unsigned int> uniformRandomVector
    (unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
    std::uniform_int_distribution<> dist = initialize(modulus);

    std::thread threads[NUM_THREADS];

    std::vector<unsigned int> v;
    v.resize(rows*columns);

    // number of entries each thread will take care of
    unsigned int positionsEachThread = rows*columns/NUM_THREADS;

    // all but the last thread
    for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
    {
        threads[i] = std::thread(unifRandVectorThreadRoutine, v, i*positionsEachThread,
            (i+1)*positionsEachThread, dist);
        // threads[i].join();
    }

    // last thread
    threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, v,
        (NUM_THREADS-1)*positionsEachThread, rows*columns, dist);
    // threads[NUM_THREADS - 1].join();

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        threads[i].join();
    }

    return v;
}

At the moment, it takes approximately 0.3 seconds: do you think there is a way to make it more efficient?


Edit: Giving each thread its own generator

I have modified the routine as follows

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end, std::uniform_int_distribution<>& dist)
{
    std::mt19937_64 generator{rd()};
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = dist(generator);
    }
}

and the running time dropped by one half. So I am still sharing the std::random_device but each thread has its own std::mt19937_64.


Edit: Giving each thread its own vector and then concatenating

I changed the code as follows:

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int length,
    std::uniform_int_distribution<>& dist)
{
    vector.reserve(length);
    std::mt19937_64 generator{rd()};
    for(unsigned int i = 0 ; i < length ; ++i)
    {
        vector.push_back(dist(generator));
    }
}

and

std::vector<unsigned int> uniformRandomVector
    (unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
    std::uniform_int_distribution<> dist = initialize(modulus);

    std::thread threads[NUM_THREADS];

    std::vector<unsigned int> v[NUM_THREADS];

    unsigned int positionsEachThread = rows*columns/NUM_THREADS;

    // all but the last thread
    for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
    {
        threads[i] = std::thread(unifRandVectorThreadRoutine, std::ref(v[i]), positionsEachThread, dist);
    }

    // last thread
    threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, std::ref(v[NUM_THREADS-1]),
        rows*columns - (NUM_THREADS-1)*positionsEachThread, dist);

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        threads[i].join();
    }

    std::vector<unsigned int> finalVector;
    finalVector.reserve(rows*columns);

    for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
    {
        finalVector.insert(finalVector.end(), v[i].begin(), v[i].end());
    }

    return finalVector;
}

The execution time is slightly worse than before, when I was using just one vector shared between all the threads. Am I missing something or can it just happen?


Edit: using a different PRNG + benchmarks

Using a different PRNG (as suggested in some comments/answers) helps a lot: I tried with the xorshift+ and here is the implementation I am using:

class xorShift128PlusGenerator
{
public:
    xorShift128PlusGenerator()
    {
        state[0] = rd();
        state[1] = rd();
    };


    unsigned long int next()
    {
        unsigned long int x = state[0];
        unsigned long int const y = state[1];
        state[0] = y;
        x ^= x << 23; // a
        state[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
        return state[1] + y;
    }


private:
    std::random_device rd; // seed generator
    unsigned long int state[2];

};

Then the routine is as follows

void unifRandVectorThreadRoutine
    (std::vector<unsigned int>& vector, unsigned int start,
    unsigned int end)
{
    xorShift128PlusGenerator prng;
    for(unsigned int i = start ; i < end ; ++i)
    {
        vector[i] = prng.next();
    }
}

Since I am now home and I am using a different (and more powerful) machine, I redid the tests to compare the results. Here is what I obtain:

Note: the execution time varies at each repetition. These are just typical values.

So there seems to be no difference if the xorshift generator is shared or not, but with all these improvements the execution time has dropped significantly.

Upvotes: 9

Views: 5884

Answers (3)

Niall
Niall

Reputation: 30606

The generator std::mt19937_64 generator{rd()}; is shared between threads. There will be some shared state that requires updating in it, hence contention; there is a data race. You should also look to allow each thread to use its own generator - you will just need to make sure that they generate separate sequences.

You possibly have a cache contention issue around std::vector<unsigned int> v;, it is declared outside the threads and then hit with each iteration of the for loop in each thread. Let each thread have its own vector to fill, once all the threads are done, collate their results in the vector v. Possibly via a std::future will be the quickest. The exact size of the contention depends on the cache line sizes and the size of the vector being used (and segmented).

In this case you fill a large number of elements (7734500) with a comparatively small number of threads (4), the ratio would possibly lead to fewer contentions.

W.r.t. the number threads you could be using, you should consider binding the NUM_THREADS to the hardware concurrency available on the target; i.e. std::thread::hardware_concurrency().

When dealing with this large a number of elements, you could also look to avoid unnecessary initialisations and moving of the results (albeit given the int type, the move is less not noticeable here). The container itself is also something to be aware of; vector requires contiguous memory, so any additional elements (during a coalition phase) could result in memory allocation and copying.

The speed of the random number generator may also have an impact, other implementations and/or algorithms may impact the final execution times significantly enough to be considered.

As always with all performance based questions - the final solution requires measurement. Implement the possible solutions, measure over the target processors and environments and adapt until a suitable performance is found.

Upvotes: 8

David Haim
David Haim

Reputation: 26476

  std::vector<unsigned int> v;
    v.resize(rows*columns);

Unfortunatly, std::vector::resize value-intialize primitives as well, making your program once write zeros over the vector memory, then overriding this value with the random numbers.

try std::vector::reserve + std::vector::push_back.
that means the the threads can no longer share the vector without a lock, but you can give each one it's own vector, use reserve+push_back then combine all the results to bigger vector.

if that is not enough, and I hate to say that , use std::unique_ptr with malloc (with costume deleter). yes, this is C, yes this is nasty, yes, we have new[] , but malloc won't zero initalize the memory (unlike new[] and stl containers), then you can spread segments of the memory to each threads and let it generate the random number on it. you will save combining the vectors to one combined vector.

Upvotes: 0

Daniel Langr
Daniel Langr

Reputation: 23497

The Mersenne Twister generator (std::mt19937_64) is not too fast. You might consider other generators such as Xorshift+. See, e.g., this question: What is performance-wise the best way to generate random bools? (the discussion there goes beyond just bools).

And you should get rid of the data race in your code. Use one generator per thread.

Upvotes: 3

Related Questions