Data Structure Decision on Efficient Frequency Calculation

Question

Question: Which data structure is more efficient when calculating n most frequent words in a text file. Hash tables or Priority Queues?

I've previously asked a question related to this subject however after the creative responses I got confused and I've decided on two data types that I actually implement easily; Hash table vs Priority Queues

Priority Queue Confusion: To be honest, I've listened to a lecture from youtube related to priority queues, understood it's every component, however when it comes to its applicability, I get confused. Using a binary heap I can easily implement the priority queue however my challenge is the match its components usage to frequency problem.

My Hash table Idea: Since in here deciding the on hash table's size was a bit uncertain I've decided to go with what makes more sense to me: 26. Due to the number of letters in alphabet. In addition, with a good hash function it would be efficient. However reaching out and out again for linked lists (using separate chaining for collusion) and incrementing its integer value by 1 ,in my opinion, wouldn't be efficient.

Sorry for the long post, but as fellow programmers which one would you recommend. If priority queue can you simply give me ideas to relate it to my question, if hash table could anything be done to make it even more efficient ?

DRVic · Accepted Answer

A hash table would be the faster of the two choices offered, besides making more sense. Rather than choosing the size 26, if you have an estimate of the total number of unique words (and most people's vocabularies outside of technical specialized terms is not a lot bigger than 10,000 - 20,000 is really big, and 30,000 is for people who make a hobby of collecting words), make the size big enough that you don't expect to ever fill it so the probability of a collision is low - not more than 25%. If you want to be more conservative, implement a function to rehash the contents of the table into a table of twice the original size (and make the size a prime, so only approximately twice the original size).

Now since this is tagged C++, you might ask yourself why you aren't just using a multiset straight out of the standard template library. It will keep a count of how many of each word you enter into it.

In either case you'll need to make a separate pass to find which of the words are the n most frequent, as you only have the frequencies, not the rank order of the frequencies.

Data Structure Decision on Efficient Frequency Calculation

Answers (2)

Related Questions