User54211
User54211

Reputation: 121

Enable autocomplete querying in ElasticSearch

I am trying to build an ElasticSearch index which will have documents with product names, for instance of laptops -

{ "name" : "Laptop Blue I7"}

Then I want to use it for autocomplete suggestion by querying the ES index. I have 2 main constraints:

  1. There can be Synonyms of the name -

I want to define Synonyms for terms, like "Notebook" for "Laptop" The ingested documents can be of the following kind -

"Laptop Blue I7"
"Laptop Blue I7"
"Laptop Blue I7"
"Laptop Blue I7"
"Laptop Red I7"
"Laptop Red I7"
"Notebook Blue I7"

Now, I am adding the following settings and mapping file while creating the index -

{
  "settings": {
    "index": {
      "analysis": {
        "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms" : ["Laptop,Notebook"]
                    }
                },
        "analyzer": {
        "synonym" : {
                        "tokenizer" : "keyword",
                        "filter" : ["synonym"]
                    }
}}}}, 
"mappings": {
    "catalog": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "synonym"
        }
      }
    }
  }
}

  1. Querying -

When I query the data, with "Notebook", the preferred response should be ordered in terms of frequency and synonym. However, when I query, the response is normally independent of the synonym and frequency. I use the following query -

/_search
{"query": {
        "query_string" : {"default_field" : "name", "query" : "Notebook"}
            } }

The response I get is -

"Notebook Blue I7"

While I would hope the response to be either of the following -

"Laptop Blue I7"
"Laptop Red I7"

or

"Notebook Blue I7"
"Laptop Blue I7"
"Laptop Red I7"

Any insights in resolving this would be helpful. Thanks

======== Edit 1:

When I use \_analyze on "Notebook" the response is

{'tokens': [{'end_offset': 3,
             'position': 0,
             'start_offset': 0,
             'token': 'Notebook',
             'type': '<ALPHANUM>'},
            {'end_offset': 3,
             'position': 0,
             'start_offset': 0,
             'token': 'Laptop',
             'type': 'SYNONYM'}]}

Upvotes: 1

Views: 790

Answers (2)

Nishant
Nishant

Reputation: 7864

As Amit mentioned, to implement autocomplete edge n gram is what you should consider. I would like to explain why the setting you used didn't work for the complete word Notebook which when queried didn't yield the expected result. For this lets understand how analyzer above will work.

The synonym analyzer defined in the settings has two components, tokenizer and token filter. For an input string first the tokenizer will be applied. The ouput of the tokenizer will be token(s). These will then act as input of token filter and will result in final set of token(s).

You can read more on how analyzer works here.

Now lets consider the first e.g. Laptop Blue I7

For this input string first the keyword tokenizer will be applied and as you might be knowing that the keyword tokenizer takes input string and generate a single token which is the same input string without any modification. So the output of tokenizer will be Laptop Blue I7 as a single token. Now this token will act as input for synonym token filter. According to the definition, Laptop and Notebook are synonyms but none of them matches the token Laptop Blue I7 so ultimately this filter will be doing nothing and will pass on the token as it is. So the final token generated will be Laptop Blue I7.

So when you search for Notebook it will not match the document with name value as above.

Note that if the input string is just Laptop or Notebook you will get the expected tokens because the keyword tokenizer will be generating single word token for the input. This is why _analyze on "Notebook" gives you the expected result.

So the conclusion is that keyword is the culprit here. To solve this we need a tokenizer which will generate seperate tokens as laptop, blue, i7, Easiest way to solve this will be to use standard instead of keyword.

Dealing with multi-word synonym

This answer might help you.

Upvotes: 1

Amit
Amit

Reputation: 32376

Issue is with your keyword tokenizer which you have used in your synonym analyzer. Please do below things to debug your issue.

  1. Check the tokens generated for your matched and unmatched documents using analyze API.

  2. Use explain API, to understand how its generated tokens and how its matching against your inverted index.

If tokens generated for your documents in Inverted index match with the tokens generated from your search term, then elasticsearch will show it matched and explain query gives a lot of other information like how many documents in a shard matched the search term and its score etc.

Above is just a very basic steps to troubleshoot your issue, but you have not implemented a proper autocomplete search which in turn should return results for note and lapt in your case. To implement this you need to use edge n gram analyzer and this ES official post can help you implement this.

Let me know if you face any other issue or requires any clarification.


Upvotes: 1

Related Questions