Elastic Search not giving exact results python

Question

I am using match phrase query to find in ES. but i have noticed that the results returned are not appropriate. code --

      res = es.search(index=('indice_1'),

               body = {
    "_source":["content"],

    "query": {
        "match_phrase":{
        "content":"xyz abc"
        }}}

   ,
size=500,
scroll='60s')

It doesn't get me records where content is - "hi my name isxyz abc." and "hey wassupxyz abc. how is life"

doing a similar search in mongodb using using regex gets both the records as well. Any help would be appreciated.

Tim · Accepted Answer

If you didn't specify an analyzer then you are using standard by default. It will do grammar based tokenization. So your terms for the phrase "hi my name isxyz abc." will be something like [hi, my, name, isxyz, abc] and match_phrase is looking for the terms [xyz, abc] right next to each other (unless you specify slop).

You can either use a different analyzer or modify your query. If you use a match query, it will match on the term "abc". If you want the phrase to match, you'll need to use a different analyzer. NGrams should work for you.

Here's an example:

PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }, 
  "mappings": {
    "_doc": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

PUT test_index/_doc/1
{
  "content": "hi my name isxyz abc."
}

PUT test_index/_doc/2
{
  "content": "hey wassupxyz abc. how is life"
}

POST test_index/_doc/_search
{
  "query": {
    "match_phrase": {
      "content": "xyz abc"
    }
  }
}

That results in finding both documents.

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test_index",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "content": "hey wassupxyz abc. how is life"
        }
      },
      {
        "_index": "test_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "content": "hi my name isxyz abc."
        }
      }
    ]
  }
}

EDIT: If you're looking to do a wildcard query, you can use the standard analyzer. The use case you specified in the comments would be added like this:

PUT test_index/_doc/3
{
  "content": "RegionLasit Pant0Q00B000001KBQ1SAO00"
}

And you can query it with wildcard:

POST test_index/_doc/_search
{
  "query": {
    "wildcard": {
      "content.keyword": {
        "value": "*Lasit Pant*"
      }
    }
  }
}

Essentially you are doing a substring search without the nGram analyzer. Your query phrase will then just be "**". I would still recommend looking into nGrams.

Elastic Search not giving exact results python

Answers (2)

Related Questions