Lasit Pant
Lasit Pant

Reputation: 317

Elastic Search not giving exact results python

I am using match phrase query to find in ES. but i have noticed that the results returned are not appropriate. code --

      res = es.search(index=('indice_1'),

               body = {
    "_source":["content"],

    "query": {
        "match_phrase":{
        "content":"xyz abc"
        }}}

   ,
size=500,
scroll='60s')

It doesn't get me records where content is - "hi my name isxyz abc." and "hey wassupxyz abc. how is life"

doing a similar search in mongodb using using regex gets both the records as well. Any help would be appreciated.

Upvotes: 3

Views: 132

Answers (2)

Pratik Patel
Pratik Patel

Reputation: 6978

you can also use type parameter to set to phrase in the query

 res = es.search(index=('indice_1'),

               body = {
    "_source":["content"],

    "query": {
        "query":"xyz abc"
        },
        type:"phrase"}

   ,
size=500,
scroll='60s')

Upvotes: 0

Tim
Tim

Reputation: 1286

If you didn't specify an analyzer then you are using standard by default. It will do grammar based tokenization. So your terms for the phrase "hi my name isxyz abc." will be something like [hi, my, name, isxyz, abc] and match_phrase is looking for the terms [xyz, abc] right next to each other (unless you specify slop).

You can either use a different analyzer or modify your query. If you use a match query, it will match on the term "abc". If you want the phrase to match, you'll need to use a different analyzer. NGrams should work for you.

Here's an example:

PUT test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }, 
  "mappings": {
    "_doc": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

PUT test_index/_doc/1
{
  "content": "hi my name isxyz abc."
}

PUT test_index/_doc/2
{
  "content": "hey wassupxyz abc. how is life"
}

POST test_index/_doc/_search
{
  "query": {
    "match_phrase": {
      "content": "xyz abc"
    }
  }
}

That results in finding both documents.

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "test_index",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.5753642,
        "_source": {
          "content": "hey wassupxyz abc. how is life"
        }
      },
      {
        "_index": "test_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "content": "hi my name isxyz abc."
        }
      }
    ]
  }
}

EDIT: If you're looking to do a wildcard query, you can use the standard analyzer. The use case you specified in the comments would be added like this:

PUT test_index/_doc/3
{
  "content": "RegionLasit Pant0Q00B000001KBQ1SAO00"
}

And you can query it with wildcard:

POST test_index/_doc/_search
{
  "query": {
    "wildcard": {
      "content.keyword": {
        "value": "*Lasit Pant*"
      }
    }
  }
}

Essentially you are doing a substring search without the nGram analyzer. Your query phrase will then just be "*<my search terms>*". I would still recommend looking into nGrams.

Upvotes: 3

Related Questions