SystematicFrank
SystematicFrank

Reputation: 17279

Text search in NoSQL with mapreduce

I am working on an app that requires searching a large list of titles. Ideally I would like to use NoSQL but it seems that text search across the whole database is not as good as in SQL databases (please correct me if I am wrong)

In any case, I do want to optimize the speed of searches. A normal search might be fast enough, but I do want a responsive live-search AND fuzzy search. Therefore I can only think of two approches:

  1. Load the whole list of titles in memory and indexed as a trie or prefix tree

  2. Implement some type of trie algorithm with a mapreduce function. This would be preffered solution, but I am not sure if it can be done or the disk space cost might outweigh the benefits.

any ideas? Also I am not sure if a "fuzzy search" is the best implemented with a trie or with a B+ tree.

Since the "titles" are unique. Should I just use the full title as the ID?

Upvotes: 3

Views: 978

Answers (1)

Aurélien Bénel
Aurélien Bénel

Reputation: 3852

To do this efficiently, you'll have to index your text by words.

In other words, the object foo entitled MapReduce: Simplified Data Processing on Large Clusters will be mapped to the following keys:

  • MapReduce: Simplified Data Processing on Large Clusters,
  • Simplified Data Processing on Large Clusters,
  • Data Processing on Large Clusters,
  • Processing on Large Clusters,
  • on Large Clusters,
  • Large Clusters,
  • Clusters.

If the text is too long, you can truncate keys to a given number of characters (say 24).

Here is a code sample for CouchDB:

function map(o) {
  const SIZE = 24;
  function format(text, begin) {
    return text.substr(begin, SIZE).toLowerCase();
  }
  const WORD_MATCHER = /\S+/g;
  while ((match = WORD_MATCHER.exec(o.title))) {
      var begin = match.index;
      emit(format(o.title, begin), {position: begin});
  }
}

Then if you ask for keys between data process and data processZ, you will get:

{"key": "data processing on large clusters", "id": "foo", "value":{"position": 22}}

Upvotes: 3

Related Questions