ElasticSearch search list of keywords with possible typos

Question

What is the best approach for searching some keywords inside a field that contains a big text in a ElasticSearch index?

I have some words that I want to search inside a field named my_field with these constraints:

I can pass the list of the words as separate elements or together as a single string with a delimiter(like the space), the important is that each one is searched
The words can contain typos or can be written in different ways, like OpenAI can be written as open ai or openai (in lowercase). I want all of these combinations to be searched, but prioritized the results with the exact match.

Let's make an example. My words are:

cto
open
ai

So I can keep them separated or treated like a string "cto open ai", in google search engine style. The words can be also:

cto
openai

because they come from an algorithm that extracts keywords from a text and can split unique keywords in 2 "common" words or not.

The document I want as the first result has a my_field that contains a long text with: ".....cto.....open ai...". So I tried with a match query since I read there is the fuzziness parameter to control the Levenshtein distance.

With these 2 queries the result is found:

Query ok 1 (fuzziness 0 with 3 terms):✅

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "0" }}}, 
        { "match": { "my_field": { "query": "open", "fuzziness": "0"  }}},
        { "match": { "my_field": { "query": "ai", "fuzziness": "0"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

Query ok 2 (fuzziness 0 with 1 string):✅

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto open ai", "fuzziness": "0" }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

(even if I change the order of the words in the query).

But I want to find the same result even if:

the text contains open ai
my query has openai, because it's a little change/typo.

So I tried with:

Query error 3 (fuzziness AUTO with 2 terms and typo):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "openai", "fuzziness": "AUTO"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

But it finds other results before it and the strange thing is that if I use the same query of case 1, but with AUTO in place of 0, it finds other documents before, that maybe have only 1/3 words in the my_field, and not all of the 3. While I know that 1 document contains all of the 3 words exactly, so I don't understand why this is not prioritized:

Query error 4 (fuzziness AUTO with the 3 original terms that worked before with 0):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "open", "fuzziness": "AUTO"  }}},
        { "match": { "my_field": { "query": "ai", "fuzziness": "AUTO"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

I tried also with a mixed approach, given a boost to the match without "fuzziness"="AUTO", but with no luck:

Query error 5 (mixed fuzziness with 2 terms and typo):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "boost": 10 }}}, 
        { "match": { "my_field": { "query": "openai", "boost": 10  }}},
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

So how can I make a query flexible to all of these typos/litlle changes and see prioritized the documents that contains perfectly the possible combinations?

ElasticSearch search list of keywords with possible typos

Answers (1)

Related Questions