Reputation: 1176
I need to make a program that counts frequency of each word in a text, additionally I need to be able to return a list of n most often words(if more words have same frequency they are sorted alphabetically). Also there is a list of words that are not counted (Stop words).
What approach overall is more efficient?
P.S. @Moderators I know there is a similar question, but I have a different constrains that require a different structure.
Upvotes: 0
Views: 2154
Reputation: 55609
Let's assume there are k
words in total and m
distinct words and you want the n
most frequent words.
TreeMap
Since there can never by more than m
words in the map, each update / insert will cost O(log m)
, giving a total running time of O(k log m)
.
HashMap
Each update / insert will cost expected O(1)
, taking O(k)
for all words.
Then, since there will be m
words in the map, sorting will take O(m log m)
.
But we can do better than sorting - we can iterate through the HashMap
and maintain a heap (PriorityQueue
) of the n
most frequent words (sorted primarily by frequency and secondarily alphabetically). After each insert into the heap, if the size is greater than n
, we remove the least frequent word. This will take O(m log n)
.
So, the total running time would be expected O(k + m log n)
.
Comparison
Since n <= m
and m <= k
, we know that m log n <= k log m
, and, assuming there are plenty of duplicates or n
is somewhat smaller than m
, k + m log n <= k log m
, so the HashMap
is generally the preferred option.
Upvotes: 2
Reputation: 11114
You can perform a single pass over the final HashMap<String, Integer>
of counts/words, and using a TreeMap<Integer, List<String>>
and the TreeMap.firstKey()
and TreeMap.size()
methods calculate the Top N words. Exercise left to the reader. Alternately (or better), use a Red-Black tree like this one where you continually prune the lowest count node as you find counts that are larger (when the size of the tree is > N in your Top N).
Upvotes: 0
Reputation: 1768
Stop words: HashSet or regexp.
I'd go with a hash map for the frequency counts:
So you'll be doing lots of insertions. Overall insertion cost for HashMap: O(n), for TreeMap: O(n log w).
Sorting cost for hash map: O(w log w), plus extraction overhead O(w). Even if it is zero for TreeMap, the O(w log w) will become very small for large n.
Note that in general there is no guaranteed way to figure this out without benchmarking both, though.
Upvotes: 0