Aman Singh
Aman Singh

Reputation: 115

Finding words frequency of huge data in a database

If we have a huge string data in a file, we can normally use algorithm(s), say (hash + heap) or (trie + heap), etc etc to efficiently find the top 'k' words with high frequency. How do I do this if I have a huge amount of string data in my 'database'. Right now the only way I know is to query the entire data set and then implement the frequency operations on it. But querying the huge data set is a very costly operation. Is there any efficient/better way to do this?

Upvotes: 1

Views: 349

Answers (1)

amit
amit

Reputation: 178431

Finding information on huge data is done by parallelizing it and use a cluster rather then a single machine.

What you are describing is a classic map-reduce problem, that can be handled using the following functions (in pseudo code):

map(doc):
  for each word in doc:
      emitIntermediate(word,"1")
reduce(list<word>):
  emit(word,size(list))

The map reduce framework, which is implemented in many languages - allows you to easily scale the problem and use a huge cluster without much effort, taking care of failures and workers management for you.

In here: doc is a single document, it usually assumes a collection of documents. If you have only one huge document, you can of course split it to smaller documents and invoke the same algorithm.

Upvotes: 2

Related Questions