Reputation: 665
I need to fill a huge (7734500 elements) std::vector<unsigned int>
with random values, and I am trying to do it in parallel with multiple threads to achieve a higher efficiency. Here is the code that I have so far:
std::random_device rd; // seed generator
std::mt19937_64 generator{rd()}; // generator initialized with seed from rd
static const unsigned int NUM_THREADS = 4;
std::uniform_int_distribution<> initialize(unsigned long long int modulus)
{
std::uniform_int_distribution<> unifDist{0, (int)(modulus-1)};
return unifDist;
}
void unifRandVectorThreadRoutine
(std::vector<unsigned int>& vector, unsigned int start,
unsigned int end, std::uniform_int_distribution<>& dist)
{
for(unsigned int i = start ; i < end ; ++i)
{
vector[i] = dist(generator);
}
}
std::vector<unsigned int> uniformRandomVector
(unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
std::uniform_int_distribution<> dist = initialize(modulus);
std::thread threads[NUM_THREADS];
std::vector<unsigned int> v;
v.resize(rows*columns);
// number of entries each thread will take care of
unsigned int positionsEachThread = rows*columns/NUM_THREADS;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
threads[i] = std::thread(unifRandVectorThreadRoutine, v, i*positionsEachThread,
(i+1)*positionsEachThread, dist);
// threads[i].join();
}
// last thread
threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, v,
(NUM_THREADS-1)*positionsEachThread, rows*columns, dist);
// threads[NUM_THREADS - 1].join();
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
return v;
}
At the moment, it takes approximately 0.3 seconds: do you think there is a way to make it more efficient?
Edit: Giving each thread its own generator
I have modified the routine as follows
void unifRandVectorThreadRoutine
(std::vector<unsigned int>& vector, unsigned int start,
unsigned int end, std::uniform_int_distribution<>& dist)
{
std::mt19937_64 generator{rd()};
for(unsigned int i = start ; i < end ; ++i)
{
vector[i] = dist(generator);
}
}
and the running time dropped by one half. So I am still sharing the std::random_device
but each thread has its own std::mt19937_64
.
Edit: Giving each thread its own vector and then concatenating
I changed the code as follows:
void unifRandVectorThreadRoutine
(std::vector<unsigned int>& vector, unsigned int length,
std::uniform_int_distribution<>& dist)
{
vector.reserve(length);
std::mt19937_64 generator{rd()};
for(unsigned int i = 0 ; i < length ; ++i)
{
vector.push_back(dist(generator));
}
}
and
std::vector<unsigned int> uniformRandomVector
(unsigned int rows, unsigned int columns, unsigned long long int modulus)
{
std::uniform_int_distribution<> dist = initialize(modulus);
std::thread threads[NUM_THREADS];
std::vector<unsigned int> v[NUM_THREADS];
unsigned int positionsEachThread = rows*columns/NUM_THREADS;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
threads[i] = std::thread(unifRandVectorThreadRoutine, std::ref(v[i]), positionsEachThread, dist);
}
// last thread
threads[NUM_THREADS - 1] = std::thread(unifRandVectorThreadRoutine, std::ref(v[NUM_THREADS-1]),
rows*columns - (NUM_THREADS-1)*positionsEachThread, dist);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
std::vector<unsigned int> finalVector;
finalVector.reserve(rows*columns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
finalVector.insert(finalVector.end(), v[i].begin(), v[i].end());
}
return finalVector;
}
The execution time is slightly worse than before, when I was using just one vector shared between all the threads. Am I missing something or can it just happen?
Edit: using a different PRNG + benchmarks
Using a different PRNG (as suggested in some comments/answers) helps a lot: I tried with the xorshift+
and here is the implementation I am using:
class xorShift128PlusGenerator
{
public:
xorShift128PlusGenerator()
{
state[0] = rd();
state[1] = rd();
};
unsigned long int next()
{
unsigned long int x = state[0];
unsigned long int const y = state[1];
state[0] = y;
x ^= x << 23; // a
state[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
return state[1] + y;
}
private:
std::random_device rd; // seed generator
unsigned long int state[2];
};
Then the routine is as follows
void unifRandVectorThreadRoutine
(std::vector<unsigned int>& vector, unsigned int start,
unsigned int end)
{
xorShift128PlusGenerator prng;
for(unsigned int i = start ; i < end ; ++i)
{
vector[i] = prng.next();
}
}
Since I am now home and I am using a different (and more powerful) machine, I redid the tests to compare the results. Here is what I obtain:
Note: the execution time varies at each repetition. These are just typical values.
So there seems to be no difference if the xorshift generator is shared or not, but with all these improvements the execution time has dropped significantly.
Upvotes: 9
Views: 5884
Reputation: 30606
The generator std::mt19937_64 generator{rd()};
is shared between threads. There will be some shared state that requires updating in it, hence contention; there is a data race. You should also look to allow each thread to use its own generator - you will just need to make sure that they generate separate sequences.
You possibly have a cache contention issue around std::vector<unsigned int> v;
, it is declared outside the threads and then hit with each iteration of the for loop in each thread. Let each thread have its own vector to fill, once all the threads are done, collate their results in the vector v
. Possibly via a std::future
will be the quickest. The exact size of the contention depends on the cache line sizes and the size of the vector being used (and segmented).
In this case you fill a large number of elements (7734500) with a comparatively small number of threads (4), the ratio would possibly lead to fewer contentions.
W.r.t. the number threads you could be using, you should consider binding the NUM_THREADS
to the hardware concurrency available on the target; i.e. std::thread::hardware_concurrency()
.
When dealing with this large a number of elements, you could also look to avoid unnecessary initialisations and moving of the results (albeit given the int
type, the move is less not noticeable here). The container itself is also something to be aware of; vector
requires contiguous memory, so any additional elements (during a coalition phase) could result in memory allocation and copying.
The speed of the random number generator may also have an impact, other implementations and/or algorithms may impact the final execution times significantly enough to be considered.
As always with all performance based questions - the final solution requires measurement. Implement the possible solutions, measure over the target processors and environments and adapt until a suitable performance is found.
Upvotes: 8
Reputation: 26476
std::vector<unsigned int> v;
v.resize(rows*columns);
Unfortunatly, std::vector::resize
value-intialize primitives as well, making your program once write zeros over the vector memory, then overriding this value with the random numbers.
try std::vector::reserve
+ std::vector::push_back
.
that means the the threads can no longer share the vector without a lock, but you can give each one it's own vector, use reserve+push_back
then combine all the results to bigger vector.
if that is not enough, and I hate to say that , use std::unique_ptr
with malloc
(with costume deleter). yes, this is C, yes this is nasty, yes, we have new[]
, but malloc
won't zero initalize the memory (unlike new[]
and stl containers), then you can spread segments of the memory to each threads and let it generate the random number on it. you will save combining the vectors to one combined vector.
Upvotes: 0
Reputation: 23497
The Mersenne Twister generator (std::mt19937_64
) is not too fast. You might consider other generators such as Xorshift+. See, e.g., this question: What is performance-wise the best way to generate random bools? (the discussion there goes beyond just bools).
And you should get rid of the data race in your code. Use one generator per thread.
Upvotes: 3