Paolo Magnani
Paolo Magnani

Reputation: 699

ElasticSearch search list of keywords with possible typos

What is the best approach for searching some keywords inside a field that contains a big text in a ElasticSearch index?

I have some words that I want to search inside a field named my_field with these constraints:

Let's make an example. My words are:

So I can keep them separated or treated like a string "cto open ai", in google search engine style. The words can be also:

because they come from an algorithm that extracts keywords from a text and can split unique keywords in 2 "common" words or not.

The document I want as the first result has a my_field that contains a long text with: ".....cto.....open ai...". So I tried with a match query since I read there is the fuzziness parameter to control the Levenshtein distance.

With these 2 queries the result is found:

Query ok 1 (fuzziness 0 with 3 terms):✅

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "0" }}}, 
        { "match": { "my_field": { "query": "open", "fuzziness": "0"  }}},
        { "match": { "my_field": { "query": "ai", "fuzziness": "0"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

Query ok 2 (fuzziness 0 with 1 string):✅

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto open ai", "fuzziness": "0" }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

(even if I change the order of the words in the query).

But I want to find the same result even if:

So I tried with:

Query error 3 (fuzziness AUTO with 2 terms and typo):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "openai", "fuzziness": "AUTO"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

But it finds other results before it and the strange thing is that if I use the same query of case 1, but with AUTO in place of 0, it finds other documents before, that maybe have only 1/3 words in the my_field, and not all of the 3. While I know that 1 document contains all of the 3 words exactly, so I don't understand why this is not prioritized:

Query error 4 (fuzziness AUTO with the 3 original terms that worked before with 0):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "open", "fuzziness": "AUTO"  }}},
        { "match": { "my_field": { "query": "ai", "fuzziness": "AUTO"  }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

I tried also with a mixed approach, given a boost to the match without "fuzziness"="AUTO", but with no luck:

Query error 5 (mixed fuzziness with 2 terms and typo):❌

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "my_field": { "query": "cto", "boost": 10 }}}, 
        { "match": { "my_field": { "query": "openai", "boost": 10  }}},
        { "match": { "my_field": { "query": "cto", "fuzziness": "AUTO" }}}, 
        { "match": { "my_field": { "query": "openai", "fuzziness": "AUTO" }}}
      ],
      "minimum_should_match" : 1
    }
  }
}

So how can I make a query flexible to all of these typos/litlle changes and see prioritized the documents that contains perfectly the possible combinations?

Upvotes: 4

Views: 120

Answers (1)

imotov
imotov

Reputation: 30153

I would index my_field twice, once as is and then second time where I would first split words on cases but then combine words in bigrams using shingle filter. In the search I would search both the original field and the bigrams field giving the original field higher boost.

There are different ways of doing this depending on how many words mingled together you want to match the boost level, etc, but hopefully this example will get you started:

DELETE my_index
PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "tuples_index": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 2,
          "output_unigrams": false,
          "token_separator": ""
        },
        "tuples_search": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 2,
          "output_unigrams": true,
          "token_separator": ""
        }
      }, 
      "analyzer": {
        "standard_shingle_index": {
          "tokenizer": "standard",
          "filter": [ "word_delimiter", "lowercase", "tuples_index" ]
        },
        "standard_shingle_search": {
          "tokenizer": "standard",
          "filter": [ "word_delimiter", "lowercase", "tuples_search" ]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "fields": {
          "tuples": {
            "type": "text",
            "analyzer": "standard_shingle_index",
            "search_analyzer": "standard_shingle_search"
          }
        }
      }
    }
  }
}

PUT my_index/_bulk?refresh
{"index": {}}
{"my_field": "Mira Murati (born 1988) is a United States-based, Albanian-born engineer, researcher and business executive. She is currently the chief technology officer of OpenAI, the artificial intelligence research company that develops ChatGPT." }
{"index": {}}
{"my_field": "Women You Should Know: Mira Murati, CTO of Open A.I." }

GET my_index/_validate/query?explain

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "OpenAI",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "my_field.tuples": {
              "query": "OpenAI"
            }
          }
        }
      ]
    }
  }
}

GET my_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "Open AI",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "my_field.tuples": {
              "query": "Open AI"
            }
          }
        }
      ]
    }
  }
}

Upvotes: 0

Related Questions