Reputation: 17279
I am working on an app that requires searching a large list of titles. Ideally I would like to use NoSQL but it seems that text search across the whole database is not as good as in SQL databases (please correct me if I am wrong)
In any case, I do want to optimize the speed of searches. A normal search might be fast enough, but I do want a responsive live-search AND fuzzy search. Therefore I can only think of two approches:
Load the whole list of titles in memory and indexed as a trie or prefix tree
Implement some type of trie algorithm with a mapreduce function. This would be preffered solution, but I am not sure if it can be done or the disk space cost might outweigh the benefits.
any ideas? Also I am not sure if a "fuzzy search" is the best implemented with a trie or with a B+ tree.
Since the "titles" are unique. Should I just use the full title as the ID?
Upvotes: 3
Views: 978
Reputation: 3852
To do this efficiently, you'll have to index your text by words.
In other words, the object foo
entitled MapReduce: Simplified Data Processing on Large Clusters
will be mapped to the following keys:
MapReduce: Simplified Data Processing on Large Clusters
,Simplified Data Processing on Large Clusters
,Data Processing on Large Clusters
,Processing on Large Clusters
,on Large Clusters
,Large Clusters
,Clusters
.If the text is too long, you can truncate keys to a given number of characters (say 24
).
Here is a code sample for CouchDB:
function map(o) {
const SIZE = 24;
function format(text, begin) {
return text.substr(begin, SIZE).toLowerCase();
}
const WORD_MATCHER = /\S+/g;
while ((match = WORD_MATCHER.exec(o.title))) {
var begin = match.index;
emit(format(o.title, begin), {position: begin});
}
}
Then if you ask for keys between data process
and data processZ
, you will get:
{"key": "data processing on large clusters", "id": "foo", "value":{"position": 22}}
Upvotes: 3