Tokenizing string for completion suggester

Question

want to build the auto complete functionality of an e-commerce website, using Completion Suggester.

This is my Index:

PUT myIndex
{
    "mappings": {
        "_doc" : {
            "properties" : {
                "suggest" : {
                    "type" : "completion"
                },
                "title" : {
                    "type": "keyword"
                }, 
                "category" : { 
                    "type": "keyword"
                },
                "description" : { 
                    "type": "keyword"
                }
            }
        }
    }
}

Now, when uploading the advertisement I want the title field to be used for auto complete, so this is how I upload a document:

POST dummy/_doc
{
  "title": "Blue asics running shoes",
  "category": "sports",
  "description": "Nice blue running shoes, size 44 eu",
  "suggest": {
    "input": "Blue Asics running shoes" // <-- use title
  }
}

Problem is, this way, elastic search only matches the string from beginning... i.e. "Blu" will find result but "Asic" or "Run" or "Sho" won't return anything...

So what I need to do is to tokenize my input like this:

POST dummy/_doc
{
  "title": "Blue asics running shoes",
  "category": "sports",
  "description": "Nice blue running shoes, size 44 eu",
  "suggest": {
    "input": ["Blue", "Asics", "running", "shoes"] // <-- tokenized title
  }
}

This would work fine... But how am I supposed to tokenize my field? I know I can split the string in c#, but is there anyway that I can do this in Elasticsearch/Nest?

Russ Cam · Accepted Answer

Completion suggester is designed for fast search-as-you-type prefix queries, using a simple analyzer, and not the standard analyzer which is default for text datatypes.

If you need partial prefix matching on any tokens in the title and not just from the beginning of the title, you may want to consider taking one of these approaches:

use Analyze API with an analyzer that will tokenize the title into tokens/terms from which you would want to partial prefix match, and index this collection as the input to the completion field. The Standard analyzer may be a good one to start with.

Bear in mind that the data structure for completion suggester is held in memory whilst in use, so high terms cardinality across documents will increase the memory demands of this data structure. Also consider that "scoring" of matching terms is simple in that it is controlled by the weight applied to each input.

or

Don't use the Completion Suggester here and instead set up the title field as a text datatype with multi-fields that include the different ways that title should be analyzed (or not analyzed, with a keyword sub field for example).

Spend some time with the Analyze API to build an analyzer that will allow for partial prefix of terms anywhere in the title. As a start, something like the Standard tokenizer, Lowercase token filter, Edgengram token filter and possibly Stop token filter would get you running. Also note that you would want a Search analyzer that does something similar to the Index analyzer except Edgengram token filter, as tokens in the search input would not need to be ngrammed.

Tokenizing string for completion suggester

Answers (2)

Related Questions