Reputation: 5476

Algorithm: A Better Way To Calculate Frequencies of a list of words

This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.

The first and unfortunately only thing that pops up in my mind use to use a std::map. I know fellow C++'ers will say that unordered_map would be so much reasonable.

I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement

Any ideas?

Upvotes: 6

Answers (5)

Running Wild

Reputation: 3011

If you are just interested in the top N most frequent words, and you don't need it to be exact, then there is a very clever structure you can use. I heard of this by way of Udi Manber, it works as follows:

You create an array of N elements, each element tracks a value and a count, you also keep a counter that indexes into this array. Additionally, you have a map from value to index into that array. Every time you update your structure with a value (like a word from a stream of text) you first check your map to see if that value is already in your array, if it is you increment the count for that value. If it is not then you decrement the count of whatever element your counter is pointing at and then increment the counter.

This sounds simple, and nothing about the algorithm makes it seem like it will yield anything useful, but for typical real data it tends to do very well. Normally if you wish to track the top N things you might want to make this structure with the capacity of 10*N, since there will be a lot of empty values in it. Using the King James Bible as input, here is what this structure lists as the most frequent words (in no particular order):

0 : in
1 : And
2 : shall
3 : of
4 : that
5 : to
6 : he
7 : and
8 : the
9 : I

And here are the top ten most frequent words (in order):

0 : the ,  62600
1 : and ,  37820
2 : of ,  34513
3 : to ,  13497
4 : And ,  12703
5 : in ,  12216
6 : that ,  11699
7 : he ,  9447
8 : shall ,  9335
9 : unto ,  8912

You see that it got 9 of the top 10 words correct, and it did so using space for only 50 elements. Depending on your use case the savings on space here may be very useful. It is also very fast.

Here is the implementation of topN that I used, written in Go:

type Event string

type TopN struct {
  events  []Event
  counts  []int
  current int
  mapped  map[Event]int
}
func makeTopN(N int) *TopN {
  return &TopN{
    counts: make([]int, N),
    events: make([]Event, N),
    current: 0,
    mapped: make(map[Event]int, N),
  }
}

func (t *TopN) RegisterEvent(e Event) {
  if index, ok := t.mapped[e]; ok {
    t.counts[index]++
  } else {
    if t.counts[t.current] == 0 {
      t.counts[t.current] = 1
      t.events[t.current] = e
      t.mapped[e] = t.current
    } else {
      t.counts[t.current]--
      if t.counts[t.current] == 0 {
        delete(t.mapped, t.events[t.current])
      }
    }
  }
  t.current = (t.current + 1) % len(t.counts)
}

Upvotes: 1

Andrew Tomazos

Reputation: 68688

The best data structure to use for this task is a Trie:

http://en.wikipedia.org/wiki/Trie

It will outperform a hash table for counting strings.

Upvotes: 6

Jerry Coffin

Reputation: 490238

I would not start with std::map (or unordered_map) if I had much choice (though I don't know what other constraints may apply).

You have two data items here, and you use one as the key part of the time, but the other as the key another part of the time. For that, you probably want something like a Boost Bimap or possibly Boost MultiIndex.

Here's the general idea using Bimap:

#include <boost/bimap.hpp>
#include <boost/bimap/list_of.hpp>
#include <iostream>

#define elements(array) ((sizeof(array)/sizeof(array[0])))

class uint_proxy {
    unsigned value;
public:
    uint_proxy() : value(0) {}
    uint_proxy& operator++() { ++value; return *this; }
    unsigned operator++(int) { return value++; }
    operator unsigned() const { return value; }
};

int main() {    
    int b[]={2,4,3,5,2,6,6,3,6,4};

    boost::bimap<int, boost::bimaps::list_of<uint_proxy> > a;

    // walk through array, counting how often each number occurs:
    for (int i=0; i<elements(b); i++) 
        ++a.left[b[i]];

    // print out the most frequent:
    std::cout << a.right.rbegin()->second;
}

For the moment, I've only printed out the most frequent number, but iterating N times to print out the N most frequent is pretty trivial.

Upvotes: 1

jpalecek

Reputation: 47750

Given a file with a word in each line, calculating most n frequent numbers. ... I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n)

If you meant the *n*s arethe same then no, this is not possible. However, if you just meant time linear in terms of the size of the input file, then a trivial implementation with a hash table will do what you want.

There might be probabilistic approximate algorithms with sublinear memory.

Upvotes: 0

noMAD

Reputation: 7844

There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMapwould not be an efficient way to do it. Here are some things which you can do depending on your problem:

If you know that the number of unique words are very limited, you can use a TreeMap or in your case std::map.
If the number of words are very large then you can build a trie and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n. So you don't need to store all the words, just the necessary ones.

Upvotes: 2

Algorithm: A Better Way To Calculate Frequencies of a list of words

Answers (5)

Related Questions