Let's say you are given a huge file, say 1GB. The file contains a word on each line (total n words), and you want to find the k most frequent terms in the file. Now, assuming you have enough memory to store these words, what is the better way to approach the question in terms of reducing the memory usage and the constant overhead in the Big-O complexity? I believe there are two basic algorithms one can use: Use a hash table and a min-heap to store the occurrences and the top K words seen. This is O(n + nlogk) ~ O(N) Use a trie to store the words and occurrences and then traverse the trie to count the most frequent words. This is O(n*p) ~ O(N) where p is the length of the longest word. Which is a better approach? Also: if you didn't have enough memory for a hash table/trie (i.e. limited memory of 10MB or so), then what is the best approach?

user1921187

Reputation: 639

Finding k most common words in a file - memory usage

Let's say you are given a huge file, say 1GB. The file contains a word on each line (total n words), and you want to find the k most frequent terms in the file.

Now, assuming you have enough memory to store these words, what is the better way to approach the question in terms of reducing the memory usage and the constant overhead in the Big-O complexity? I believe there are two basic algorithms one can use:

Use a hash table and a min-heap to store the occurrences and the top K words seen. This is O(n + nlogk) ~ O(N)
Use a trie to store the words and occurrences and then traverse the trie to count the most frequent words. This is O(n*p) ~ O(N) where p is the length of the longest word.

Which is a better approach?

Also: if you didn't have enough memory for a hash table/trie (i.e. limited memory of 10MB or so), then what is the best approach?

Upvotes: 13

Answers (3)

amit

Reputation: 178521

Which is more efficient regarding the constant is very dependent. On one hand, trie offers strict O(N) time complexity for inserting all elements, while the hash table might decay to quadric time on worst case.
On the other hand, tries are not very efficient when it comes to cache - each seek requires O(|S|) random access memory requests, which might cause the performance to decay significantly.

Both approaches are valid, and I think there are multiple considerations that should be taken when choosing one over the other like maximum latency (if it is a real time system), throughput, and time to develop.

If average case performance is all that matters, I'd suggest to generate a bunch of files and run statistical analysis which approach is better. Wilcoxon signed test is the de-facto state of the art hypothesis test in use.

Regarding embedded systems: both approaches are still valid, but in here: Each "Node" (or bunch of nodes) in the trie will be on disk rather then on RAM. Note that it means for the trie O(|S|) random access disk seeks per entry, which might be slowish.

For hashing solutions, you have 10MB, let's say they can use 5MB out of these for hash table of pointers to disk. Let's also assume you can store 500 different disk addresses on these 5MB (pessimistic analysis here), that means that you have 5MB left to load a bucket after each hash seek, and if you have 500 buckets, with load factor of 0.5, it means you can store 500 * 5MB * 0.5 ~= 1.25GB > 1GB of you data, thus using the hash table solution, so using hashing - each seek will need only O(1) random disk seeks in order to find the bucket containing the relevant string.

Note that if it still is not enough, we can rehash the pointers tables, very similar to what is being done in the paging table in the virtual memory mechanism.

From this we can conclude, for embedded systems, the hash solution is better for most cases (note it might still suffer from high latency on worst cases, no silver bullet here).

PS, radix tree is usually faster and more compact then trie, but suffers from the same side effects of trie comparing to hash tables (though less significant, of course).

Upvotes: 5

Constantine Kuznetsov

Reputation: 187

Do you drive to store intermediate results ? if true:

you may have some meta structure. and a set of hashetable. You read a part of data (while size of you hash < 3 mb) and fill hashtable. while size > 3mb you save on disk. if you limit is 10 mb size of hashtable is 3 mb(for example).

meta discribe your hashtables. in meta you may store number of unique word and count of all word in this hash and max count of one world!!! i

after this. you may load hashtables from disk and merged.

for example you may load hashtable in ascending order of unique words or max count of one world in hash. in this step you may use some heuristic.

Upvotes: 0

Ross Dargan

Reputation: 6021

For the limited memory option you could quick sort the list first, then simply populate a hash table with k items in it. You would then need one more counter to know how many items where in the current word you were checking - if its higher then you replace the lowest item in the hash table with your current item.

This would probably work ok for the initial list, but would be slower than just scanning the full list, and populating a hash table with the count.

Upvotes: 0

Finding k most common words in a file - memory usage

Answers (3)

Related Questions