Reputation: 5476
This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.
The first and unfortunately only thing that pops up in my mind use to use a std::map
. I know fellow C++'ers will say that unordered_map
would be so much reasonable.
I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement
Any ideas?
Upvotes: 6
Views: 5020
Reputation: 3011
If you are just interested in the top N most frequent words, and you don't need it to be exact, then there is a very clever structure you can use. I heard of this by way of Udi Manber, it works as follows:
You create an array of N elements, each element tracks a value and a count, you also keep a counter that indexes into this array. Additionally, you have a map from value to index into that array. Every time you update your structure with a value (like a word from a stream of text) you first check your map to see if that value is already in your array, if it is you increment the count for that value. If it is not then you decrement the count of whatever element your counter is pointing at and then increment the counter.
This sounds simple, and nothing about the algorithm makes it seem like it will yield anything useful, but for typical real data it tends to do very well. Normally if you wish to track the top N things you might want to make this structure with the capacity of 10*N, since there will be a lot of empty values in it. Using the King James Bible as input, here is what this structure lists as the most frequent words (in no particular order):
0 : in
1 : And
2 : shall
3 : of
4 : that
5 : to
6 : he
7 : and
8 : the
9 : I
And here are the top ten most frequent words (in order):
0 : the , 62600
1 : and , 37820
2 : of , 34513
3 : to , 13497
4 : And , 12703
5 : in , 12216
6 : that , 11699
7 : he , 9447
8 : shall , 9335
9 : unto , 8912
You see that it got 9 of the top 10 words correct, and it did so using space for only 50 elements. Depending on your use case the savings on space here may be very useful. It is also very fast.
Here is the implementation of topN that I used, written in Go:
type Event string
type TopN struct {
events []Event
counts []int
current int
mapped map[Event]int
}
func makeTopN(N int) *TopN {
return &TopN{
counts: make([]int, N),
events: make([]Event, N),
current: 0,
mapped: make(map[Event]int, N),
}
}
func (t *TopN) RegisterEvent(e Event) {
if index, ok := t.mapped[e]; ok {
t.counts[index]++
} else {
if t.counts[t.current] == 0 {
t.counts[t.current] = 1
t.events[t.current] = e
t.mapped[e] = t.current
} else {
t.counts[t.current]--
if t.counts[t.current] == 0 {
delete(t.mapped, t.events[t.current])
}
}
}
t.current = (t.current + 1) % len(t.counts)
}
Upvotes: 1
Reputation: 68688
The best data structure to use for this task is a Trie:
http://en.wikipedia.org/wiki/Trie
It will outperform a hash table for counting strings.
Upvotes: 6
Reputation: 490238
I would not start with std::map
(or unordered_map
) if I had much choice (though I don't know what other constraints may apply).
You have two data items here, and you use one as the key part of the time, but the other as the key another part of the time. For that, you probably want something like a Boost Bimap or possibly Boost MultiIndex.
Here's the general idea using Bimap:
#include <boost/bimap.hpp>
#include <boost/bimap/list_of.hpp>
#include <iostream>
#define elements(array) ((sizeof(array)/sizeof(array[0])))
class uint_proxy {
unsigned value;
public:
uint_proxy() : value(0) {}
uint_proxy& operator++() { ++value; return *this; }
unsigned operator++(int) { return value++; }
operator unsigned() const { return value; }
};
int main() {
int b[]={2,4,3,5,2,6,6,3,6,4};
boost::bimap<int, boost::bimaps::list_of<uint_proxy> > a;
// walk through array, counting how often each number occurs:
for (int i=0; i<elements(b); i++)
++a.left[b[i]];
// print out the most frequent:
std::cout << a.right.rbegin()->second;
}
For the moment, I've only printed out the most frequent number, but iterating N times to print out the N most frequent is pretty trivial.
Upvotes: 1
Reputation: 47750
Given a file with a word in each line, calculating most n frequent numbers. ... I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n)
If you meant the *n*s arethe same then no, this is not possible. However, if you just meant time linear in terms of the size of the input file, then a trivial implementation with a hash table will do what you want.
There might be probabilistic approximate algorithms with sublinear memory.
Upvotes: 0
Reputation: 7844
There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMap
would not be an efficient way to do it. Here are some things which you can do depending on your problem:
TreeMap
or in your case std::map
.trie
and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n
. So you don't need to store all the words, just the necessary ones.Upvotes: 2