I need to make a program that counts frequency of each word in a text, additionally I need to be able to return a list of n most often words(if more words have same frequency they are sorted alphabetically). Also there is a list of words that are not counted (Stop words). What structure to use for the stop words I think that HashSet would be most efficient What structure to use for the words and frequency mapping HashMap would be more efficient for adding words, but would require sorting, TreeMap requires logn time for inserting words, but words can be sorted by frequency What approach overall is more efficient? P.S. @Moderators I know there is a similar question , but I have a different constrains that require a different structure.

javadata-structuresdictionaryhashmaptreemap

Reputation: 1176

Word Frequency - HashMap or TreeMap

I need to make a program that counts frequency of each word in a text, additionally I need to be able to return a list of n most often words(if more words have same frequency they are sorted alphabetically). Also there is a list of words that are not counted (Stop words).

What structure to use for the stop words
- I think that HashSet would be most efficient
What structure to use for the words and frequency mapping
- HashMap would be more efficient for adding words, but would require sorting, TreeMap requires logn time for inserting words, but words can be sorted by frequency

What approach overall is more efficient?

P.S. @Moderators I know there is a similar question, but I have a different constrains that require a different structure.

Upvotes: 0

Answers (3)

Bernhard Barker

Reputation: 55609

Let's assume there are k words in total and m distinct words and you want the n most frequent words.

TreeMap

Since there can never by more than m words in the map, each update / insert will cost O(log m), giving a total running time of O(k log m).

HashMap

Each update / insert will cost expected O(1), taking O(k) for all words.

Then, since there will be m words in the map, sorting will take O(m log m).

But we can do better than sorting - we can iterate through the HashMap and maintain a heap (PriorityQueue) of the n most frequent words (sorted primarily by frequency and secondarily alphabetically). After each insert into the heap, if the size is greater than n, we remove the least frequent word. This will take O(m log n).

So, the total running time would be expected O(k + m log n).

Comparison

Since n <= m and m <= k, we know that m log n <= k log m, and, assuming there are plenty of duplicates or n is somewhat smaller than m, k + m log n <= k log m, so the HashMap is generally the preferred option.

Upvotes: 2

brettw

Reputation: 11114

You can perform a single pass over the final HashMap<String, Integer> of counts/words, and using a TreeMap<Integer, List<String>> and the TreeMap.firstKey() and TreeMap.size() methods calculate the Top N words. Exercise left to the reader. Alternately (or better), use a Red-Black tree like this one where you continually prune the lowest count node as you find counts that are larger (when the size of the tree is > N in your Top N).

Upvotes: 0

creichen

Reputation: 1768

Stop words: HashSet or regexp.

I'd go with a hash map for the frequency counts:

Your universe of words w is (more or less) finite
Your input size n is (presumably) unbounded

So you'll be doing lots of insertions. Overall insertion cost for HashMap: O(n), for TreeMap: O(n log w).

Sorting cost for hash map: O(w log w), plus extraction overhead O(w). Even if it is zero for TreeMap, the O(w log w) will become very small for large n.

Note that in general there is no guaranteed way to figure this out without benchmarking both, though.

Upvotes: 0

Word Frequency - HashMap or TreeMap

Answers (3)

Related Questions