Reputation: 1212
What is the best way to sort and concatenate million or billion STL vectors into a single STL vector. Presently, the way I am doing it is to just iterate over the vectors and perform each operation.
Here is the pseudo code
typedef unsigned long long int ULLInt;
ULLInt N = 1000000;
vector<vector<ULLInt> > vecVec( N, vector<ULLInt>() );
vector<ULLInt> concatVec;
// ...
// ... fill vectors inside vecVec here
// .. we also get here the total number of values inserted in all vectors (count)
// ...
// reserve the space
concatVec.reserve( count);
// sort each vector and concatenate them in sequence
for( ULLInt i=0; i<N; i++)
sort( vecVec[i].begin(), vecVec[i].end() );
concatVec.insert( concatVec.end(), vecVec[i].begin(), vecVec[i].end() );
end for
Note that there is no requirement for concatVec to be sorted. Thanks for suggestions.
Upvotes: 4
Views: 1534
Reputation: 48021
One thing I'd do is ask if you really need to concatentate a million std::vectors. What if you added each vector to a list and made your own iterator that would traverse each element in each vector? For most algorithms, that would be indistinguishable from one massive vector. And, depending on the load, the extra work done in the custom iterator would be far less than all the work needed to actually concatenate all the vectors.
Upvotes: 3
Reputation: 789
If vectors in vecVec are filled in ascending order(as I understand from chat - that is the your use case), then you may try to use one vector instead of many smalls, maintaining begin index of each vector in separate indices array. That will avoid expensive concatination, by "constructing" sub-vectors in place.
#include <vector>
#include <algorithm>
#include <cstdlib>
#include <iterator>
int main(int argc,char *argv[])
{
using namespace std;
typedef int Payload;
vector<Payload> vec;
vector<size_t> indices;
for(unsigned i=0;i!=100;++i)
{
indices.push_back(vec.size()); // begin index of current vector
// filling current sub-vector
generate_n(back_inserter(vec),777+i,rand);
}
indices.push_back(vec.size()); // end of last vector, or begin of
// one-past last vector
// sorting each sub vector
for(size_t i=0;i+1!=indices.size();++i)
{
// can be done in parallel. be aware of possible false sharing
sort(vec.begin()+indices[i],vec.begin()+indices[i+1]);
}
return 0;
}
Upvotes: 1
Reputation: 6155
How about this:
some quick code (prob. won't compile but you might get the point):
typedef vector<vector<ULLINT>> ManyVectors;
void merge(ManyVectors vector_of_vectors) {
const int cores = 16;
std::array<ManyVectors, cores> piles = split_vector(vector_of_vectors,cores);
std::array<size_t, cores> sizes = calculate_sizes(piles,cores);
std::vector<ULLINT> result;
result.reserve(sum_of_sizes(sizes));
int used = 0;
int core = 0;
for (ManyVectors& pile: piles) {
std::thread(merge_vectors, pile, result.begin()+used);
used += sizes[core];
core += 1;
}
}
Upvotes: 1
Reputation: 76448
Each time the code inserts the contents of one of the vectors it has to make sure the target vector has enough memory to hold the result. This means that it will frequently reallocate the memory for the target vector. That means copying its contents, and the code ends up doing that many, many times. It will be much faster to pre-allocate the memory for the target vector to the eventual full size. Read about vector::reserve()
.
Upvotes: 1