baiduXiu
baiduXiu

Reputation: 167

issue with edge_ngram tokenizer IN Elastic search

I am using an edge ngram tokenizer in order to provide partial matching. My documents look like

Name
Labson series LTD 2014
Labson PLO LTD 2014A
Labson PLO LTD 2014-I
Labson PLO LTD. 2014-II

My mapping is as follows

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 40,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "title": {
          "type": "string",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

PUT my_index/doc/1
{
  "title": "Labson Series LTD 2014" 
}

PUT my_index/doc/2
{
  "title": "Labson PLO LTD 2014A" 
}


PUT my_index/doc/3
{
  "title": "Labson PLO LTD 2014-I" 
}


PUT my_index/doc/4
{
  "title": "Labson PLO LTD. 2014-II" 
}

The following query gives me 3 documents which is correct (Labson PLO LTD 2014A, Labson PLO LTD 2014-I, Labson PLO LTD. 2014-II)

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "labson plo", 
        "operator": "and"
      }
    }
  }
}

But when I type in Labson PLO 2014A it gives me 0 documents

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Labson PLO 2014A", 
        "operator": "and"
      }
    }
  }
}

I expect this to return 1 document Labson PLO LTD 2014A, but for some reason it seems like it is not indexing the digits in token. Let me know if i am missing thing anything over here.

Upvotes: 0

Views: 261

Answers (1)

root
root

Reputation: 3937

In your autocomplete_search you are using lowercase tokeinzer. which performs the function of Letter Tokenizer and Lower Case Token Filter together .

https://www.elastic.co/guide/en/elasticsearch/reference/2.3//analysis-lowercase-tokenizer.html

Now lets see what Letter Tokenizer does.

The letter tokenizer breaks text into terms whenever it encounters a character which is not a letter.

https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-letter-tokenizer.html

so in your case when you queried.

"query": "Labson PLO 2014A",

The query actually become

"+title:labson +title:plo +title:a"

since the letter tokenizer dropped 2014. now your index tokens does not contain a token with just the letter a. that is why you did not get any results back.

you can analyze your query like this in kibana

POST my_index/_validate/query?explain
{
  "query": {
    "match": {
      "title": {
        "query": "Labson PLO 2014a", 
        "operator": "and"
      }
    }
  }
}

and you will see 2014 is getting dropped. from the final query.

Also to see what the letter tokenizer produces use the following query

POST _analyze
{
  "tokenizer": "letter",
  "text": "Labson PLO LTD 2014a"
}

Upvotes: 1

Related Questions