Lazik
Lazik

Reputation: 2520

Efficient frequency counter

I have 15,000,000 std:vectors of 6 integers.

Those 15M vectors contain duplicates. Duplicate example:

(4,3,2,0,4,23)
(4,3,2,0,4,23)

I need to obtain a list of unique sequence with their associated count. (A sequence that is only present once would have a 1 count)

Is there an algorithm in the std C++ (can be x11) that does that in one shot?

Windows, 4GB RAM, 30+GB hdd

Upvotes: 1

Views: 3673

Answers (3)

leemes
leemes

Reputation: 45705

There is no such algorithm in the standard library which does exactly this, however it's very easy with a single loop and by choosing the proper data structure.

For this you want to use std::unordered_map which is typically a hash map. It has expected constant time per access (insert and look-up) and thus the first choice for huge data sets.

The following access and incement trick will automatically insert a new entry in the counter map if it's not yet there; then it will increment and write back the count.

typedef std::vector<int> VectorType;        // Please consider std::array<int,6>!

std::unordered_map<VectorType, int> counters;

for (VectorType vec : vectors) {
    counters[vec]++;
}

For further processing, you most probably want to sort the entries by the number of occurrence. For this, either write them out in a vector of pairs (which encapsulates the number vector and the occurrence count), or in an (ordered) map which has key and value swapped, so it's automatically ordered by the counter.

In order to reduce the memory footprint of this solution, try this:

If you don't need to get the keys back from this hash map, you can use a hash map which doesn't store the keys but only their hashes. For this, use size_t for the key type, std::identity<std::size_t> for the internal hash function and access it with a manual call to the hash function std::hash<VectorType>.

std::unordered_map<std::size_t, int, std::identity<std::size_t> > counters;
std::hash<VectorType> hashFunc;

for (VectorType vec : vectors) {
    counters[hashFunc(vec)]++;
}

This reduces memory but requires an additional effort to interpret the results, as you have to loop over the original data structure a second time in order to find the original vectors (then look-up them in your hash map by hashing them again).

Upvotes: 9

karakale
karakale

Reputation: 146

You should convert each vector element to string one by one like this "4,3,2,0,4,23". Then add them into a new string vector by controlling their existance with find() function.

If you need original vector, convert string vector to another integer sequence vector. If you do not need delete duplicated elements while making sting vector.

Upvotes: -1

Potatoswatter
Potatoswatter

Reputation: 137870

Yes: first std::sort the list (std::vector uses lexicographic ordering, the first element is the most significant), then loop with std::adjacent_find to find duplicates. When a duplicate is found, use std::adjacent_find again but with an inverted comparator to find the first non-duplicate.

Alternately, you could use std::unique with a custom comparator that flags when a duplicate is found, and maintains a count through the successive calls. This also gives you a deduplicated list.

The advantage of these approaches over std::unordered_map is space complexity proportional to the number of duplicates. You don't have to copy the entire original dataset or add a seldom-used field for dup-count.

Upvotes: 3

Related Questions