Reputation: 2520
I have 15,000,000 std:vectors of 6 integers.
Those 15M vectors contain duplicates. Duplicate example:
(4,3,2,0,4,23)
(4,3,2,0,4,23)
I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)
Is there an algorithm in the std C++ (can be x11) that does that in one shot?
Windows, 4GB RAM, 30+GB hdd
Upvotes: 1
Views: 3673
Reputation: 45705
There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.
For this you want to use std::unordered_map
which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.
The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.
typedef std::vector<int> VectorType; // Please consider std::array<int,6>!
std::unordered_map<VectorType, int> counters;
for (VectorType vec : vectors) {
counters[vec]++;
}
For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.
In order to reduce the memory footprint of this solution, try this:
If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t
for the key type, std::identity<std::size_t>
for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>
.
std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;
for (VectorType vec : vectors) {
counters[hashFunc(vec)]++;
}
This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).
Upvotes: 9
Reputation: 146
You should convert each vector element to string one by one like this "4,3,2,0,4,23". Then add them into a new string vector by controlling their existance with find() function.
If you need original vector, convert string vector to another integer sequence vector. If you do not need delete duplicated elements while making sting vector.
Upvotes: -1
Reputation: 137870
Yes: first std::sort
the list (std::vector
uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find
to find duplicates. When a duplicate is found, use std::adjacent_find
again but with an inverted comparator to find the first non-duplicate.
Alternately, you could use std::unique
with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.
The advantage of these approaches over std::unordered_map
is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.
Upvotes: 3