Christian
Christian

Reputation: 26437

Most frequent words

What's the most efficient way in Java to get the 50 most frequent words with their frequency out of a text?

I want to search around ~1,000,000 texts with each have around ~10,000 words and hope that it works in a reasonable time frame.

Upvotes: 5

Views: 6896

Answers (4)

polygenelubricants
polygenelubricants

Reputation: 384016

Most efficient would probably be using a Patricia trie that links to a max-heap. Every time you read a word, find it on the trie, go to the heap and increase-key. If it's not in the trie, add it and set its key in the heap appropriately.

With a Fibonacci heap, increase-key is O(1).


A not so unreasonable solution is to use a Map<String, Integer>, adding the count every time a word is encountered, and then custom-sorting its entrySet() based on the count to get the top 50.

If the O(N log N) sort is unacceptable, use selection algorithm to find the top 50 in O(N).


Which technique is better really depends on what you're asking for (i.e. the comment whether this is more of an [algorithm] question than a [java] question is very telling).

The Map<String, Integer> followed by selection algorithm is most practical, but the Patricia trie solution clearly beats it in space efficiency alone (since common prefixes are not stored redundantly).

Upvotes: 8

tucuxi
tucuxi

Reputation: 17955

Following pseudocode should do the trick:

build a map<word, count>
build a tokenizer that gives you a word per iteration
for each word*,
   if word in map, increment its count
   otherwise add with count = 1
sort words by count
for each of the first 50 words,
   output word, frequency = count / total_words

This is essentially O(N), and what jpabluz suggested. However, if you are going to use this on any sort of "in the wild" text, you will notice lots of garbage: uppercase/lowercase, punctuation, URLs, stop-words such as 'the' or 'and' with very high counts, multiple variations of the same word... The right way to do it is to lowercase all words, remove all punctuation (and things such as URLs), and add stop-word removal and stemming at the point marked with the asterisk in the above pseudocode.

Upvotes: 4

pajton
pajton

Reputation: 16246

O(n):

  1. Count the number of words
  2. Split your text word wise into list of words
  3. Create a map of word=>number_of_occurences
  4. Traverse the map and select max. 50.
  5. Divide them by total number of words to get frequency

Of course some of this steps may be done at the same time or unnecessary depending on data structures you'll use.

Upvotes: 0

jpabluz
jpabluz

Reputation: 1312

Your best chance would be an O(n) algorithm, I would go for a text reader that will split the words and then, add it to an ordered tree, which you would order by number of appearences and link them to a word. After that just do a 50-iterations traverse to get the highest values.

Upvotes: 0

Related Questions