Lukas
Lukas

Reputation: 1366

Find similar documents based on a string in MongoDB

I need to find all documents in a MongoDB database that have a property containing a string that is similar to the search term but allows for a certain % in divergence.

In plain javascript I could for example use https://www.npmjs.com/package/string-similarity and then basically match all documents that have > 90% similarity score.

I'd like do to this as MongoDB query and be as performant as possible as the database contains millions of documents.

What possible options do I have in this situation?

I am really happy for every idea to get this solved in the best possible way.

Upvotes: 2

Views: 811

Answers (1)

Tom Slabbaert
Tom Slabbaert

Reputation: 22296

The common solution to this problem is to use a search engine database, like Elasticsearch or Atlas search (by Mongodb team). I will not go into too much detail on how these databases work but generally speaking they are an inverse index database, this means you tokenize your data on insert and then your queries run on the tokenized data and not on the raw data set.

This approach is very powerful and can help with many "search engine" problems like autocomplete or in your case what is called a "fuzzy" search.

Let's see how elasticsearch deals with this by reading about their fuzzy feature:

To find similar terms, the fuzzy query creates a set of all possible variations, or expansions, of the search term within a specified edit distance. The query then returns exact matches for each expansion.

Basically what they do is create all "possible" permutations of the query within the given parameters. I would personally recommend you just use one of these databases that give this ability OOTB, however if you want to do a "pseudo" search engine in Mongo you can just use this approach ( with the downside of Mongo's indexes being a tree so you force a tree scan for these queries instead of a db designed for this )

Upvotes: 1

Related Questions