Reputation: 26437
What's the most efficient way in Java to get the 50 most frequent words with their frequency out of a text?
I want to search around ~1,000,000 texts with each have around ~10,000 words and hope that it works in a reasonable time frame.
Upvotes: 5
Views: 6896
Reputation: 384016
Most efficient would probably be using a Patricia trie that links to a max-heap. Every time you read a word, find
it on the trie, go to the heap and increase-key
. If it's not in the trie, add
it and set its key in the heap appropriately.
With a Fibonacci heap, increase-key
is O(1)
.
A not so unreasonable solution is to use a Map<String, Integer>
, adding the count every time a word is encountered, and then custom-sorting its entrySet()
based on the count to get the top 50.
If the O(N log N)
sort is unacceptable, use selection algorithm to find the top 50 in O(N)
.
Which technique is better really depends on what you're asking for (i.e. the comment whether this is more of an [algorithm]
question than a [java]
question is very telling).
The Map<String, Integer>
followed by selection algorithm is most practical, but the Patricia trie solution clearly beats it in space efficiency alone (since common prefixes are not stored redundantly).
Upvotes: 8
Reputation: 17955
Following pseudocode should do the trick:
build a map<word, count>
build a tokenizer that gives you a word per iteration
for each word*,
if word in map, increment its count
otherwise add with count = 1
sort words by count
for each of the first 50 words,
output word, frequency = count / total_words
This is essentially O(N), and what jpabluz suggested. However, if you are going to use this on any sort of "in the wild" text, you will notice lots of garbage: uppercase/lowercase, punctuation, URLs, stop-words such as 'the' or 'and' with very high counts, multiple variations of the same word... The right way to do it is to lowercase all words, remove all punctuation (and things such as URLs), and add stop-word removal and stemming at the point marked with the asterisk in the above pseudocode.
Upvotes: 4
Reputation: 16246
O(n)
:
Of course some of this steps may be done at the same time or unnecessary depending on data structures you'll use.
Upvotes: 0
Reputation: 1312
Your best chance would be an O(n) algorithm, I would go for a text reader that will split the words and then, add it to an ordered tree, which you would order by number of appearences and link them to a word. After that just do a 50-iterations traverse to get the highest values.
Upvotes: 0