Reputation: 1531
I want to be able to return a list of the closest matching names from my mongo database given a string. I want to do this as efficiently as possible. To illustrate, my documents look like:
const personSchema = new Schema({
name: { type: String, required: true }
...more fields
});
And the input might look like: Barcck Obmaa
which I would expect to return a list of people with the person who's name is Barack Obama
at the top.
The algorithm should account for the fact that a string that is a prefix of a name matches that name better than a string that is otherwise equally close to the name but not a prefix.
There are a bunch of algorithms that make use of a precomputed index to make this kind of search faster. Two that have caught my eye are the Pass-Join Index and the BKTree which use algorithms like Levenshtein or Jaro-Winkler. It seems to me that there should be some way to integrate these techniques into a mongo database, but there doesn't seem to be any established way of doing so.
The best solution I could find is an n-gram based approach described in this article. Is this the best option I have?
Upvotes: 1
Views: 1229
Reputation: 3719
Years ago I worked on digitizing a medical book, and found that using a modified version of the soundex system ( https://www.archives.gov/research/census/soundex ) as a means to perform a fuzzy search worked great, with a minor modification of using the soundex number for the entire word.
For example, the actual soundex for "Washington" is...
...whereas for the medical book, it would have been encoded as...
I also ignored p's when in combination with "pf" or "ps", plus a few other minor rules I added on top of the official soundex, that I can't now remember. So, this modified soundex system allowed the doctors to totally butcher the spelling of a word when performing a search in the medical text. For example, the following two words will have the same soundex...
...so searching for "sikawligy" will return results with "psychology".
Not sure this exactly answers your question, as it appears that you're looking for close matches to someone fat fingering the spelling, whereas what I'm proposing is a means of searching based on misspelled words that sound the same as the actual word being sought...
In any event, hope this helps.
Upvotes: 1