Reputation: 115
If we have a huge string data in a file, we can normally use algorithm(s), say (hash + heap) or (trie + heap), etc etc to efficiently find the top 'k' words with high frequency. How do I do this if I have a huge amount of string data in my 'database'. Right now the only way I know is to query the entire data set and then implement the frequency operations on it. But querying the huge data set is a very costly operation. Is there any efficient/better way to do this?
Upvotes: 1
Views: 349
Reputation: 178431
Finding information on huge data is done by parallelizing it and use a cluster rather then a single machine.
What you are describing is a classic map-reduce problem, that can be handled using the following functions (in pseudo code):
map(doc):
for each word in doc:
emitIntermediate(word,"1")
reduce(list<word>):
emit(word,size(list))
The map reduce framework, which is implemented in many languages - allows you to easily scale the problem and use a huge cluster without much effort, taking care of failures and workers management for you.
In here: doc is a single document, it usually assumes a collection of documents. If you have only one huge document, you can of course split it to smaller documents and invoke the same algorithm.
Upvotes: 2