Finding words frequency of huge data in a database

Question

If we have a huge string data in a file, we can normally use algorithm(s), say (hash + heap) or (trie + heap), etc etc to efficiently find the top 'k' words with high frequency. How do I do this if I have a huge amount of string data in my 'database'. Right now the only way I know is to query the entire data set and then implement the frequency operations on it. But querying the huge data set is a very costly operation. Is there any efficient/better way to do this?

amit · Accepted Answer

Finding information on huge data is done by parallelizing it and use a cluster rather then a single machine.

What you are describing is a classic map-reduce problem, that can be handled using the following functions (in pseudo code):

map(doc):
  for each word in doc:
      emitIntermediate(word,"1")
reduce(list):
  emit(word,size(list))

The map reduce framework, which is implemented in many languages - allows you to easily scale the problem and use a huge cluster without much effort, taking care of failures and workers management for you.

In here: doc is a single document, it usually assumes a collection of documents. If you have only one huge document, you can of course split it to smaller documents and invoke the same algorithm.

Finding words frequency of huge data in a database

Answers (1)

Related Questions