Hooman Bahreini
Hooman Bahreini

Reputation: 15559

Tokenizing string for completion suggester

want to build the auto complete functionality of an e-commerce website, using Completion Suggester.

This is my Index:

PUT myIndex
{
    "mappings": {
        "_doc" : {
            "properties" : {
                "suggest" : {
                    "type" : "completion"
                },
                "title" : {
                    "type": "keyword"
                }, 
                "category" : { 
                    "type": "keyword"
                },
                "description" : { 
                    "type": "keyword"
                }
            }
        }
    }
}

Now, when uploading the advertisement I want the title field to be used for auto complete, so this is how I upload a document:

POST dummy/_doc
{
  "title": "Blue asics running shoes",
  "category": "sports",
  "description": "Nice blue running shoes, size 44 eu",
  "suggest": {
    "input": "Blue Asics running shoes" // <-- use title
  }
}

Problem is, this way, elastic search only matches the string from beginning... i.e. "Blu" will find result but "Asic" or "Run" or "Sho" won't return anything...

So what I need to do is to tokenize my input like this:

POST dummy/_doc
{
  "title": "Blue asics running shoes",
  "category": "sports",
  "description": "Nice blue running shoes, size 44 eu",
  "suggest": {
    "input": ["Blue", "Asics", "running", "shoes"] // <-- tokenized title
  }
}

This would work fine... But how am I supposed to tokenize my field? I know I can split the string in c#, but is there anyway that I can do this in Elasticsearch/Nest?

Upvotes: 3

Views: 2403

Answers (2)

Hooman Bahreini
Hooman Bahreini

Reputation: 15559

Based on Russ Cam's answer above (option 2), this Elasticsearch guide and also this document, I ended up with the following solution:

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "edge_ngram_token_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10
        },
        "additional_stop_words": {
          "type":       "stop",
          "stopwords":  ["your"]
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "C# => csharp",
            "c# => csharp"
          ]
        }
       },
       "analyzer": {
        "result_suggester_analyzer": { 
          "type": "custom",
          "tokenizer": "standard",
          "char_filter":  [ "html_strip", "my_char_filter" ],
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "asciifolding",
            "stop",
            "additional_stop_words",
            "english_stemmer",
            "edge_ngram_token_filter",
            "unique"
          ]
        }
      }
    }
  }
}

Query to test this solution:

POST my_index/_analyze
{
  "analyzer": "result_suggester_analyzer",
  "text": "C# &amp; SQL are great languages. K2 is the mountaineer's mountain. Your house-décor is à la Mode"
}

I would get these tokens (NGrams):

cs, csh, csha, cshar, csharp, sq, sql, gr, gre, grea, great, la, lan, lang,
langu, langua, languag, k2, mo, mou, moun, mount, mounta, mountai, mountain, 
ho, hou, hous, hous, de, dec, deco, decor, mod, mode

Things to note here:

  1. I am using stop filter, which is the default English language filter and is blocking are, is, the - but not your.
  2. I have defined the additional_stop_words, which stops your
  3. I am using built in english & possessive_english stemmers, which would tokenize the words stems: that's why we have languag token but not language or languages... also note that we have mountain but not mountaineering.
  4. I have defined mapped_words_char_filter which convert C# to csharp, without this c# would not be a valid token... (this setting would not tokenize F#)
  5. I am using built in html_strip, char_filter which converts &amp; to &, and it is ignored since our min_gram = 2
  6. We are using built it asciifolding token filter and that's why décor is tokenized as decor.

This is the NEST code for the above:

var createIndexResponse = ElasticClient.CreateIndex(IndexName, c => c
    .Settings(st => st
        .Analysis(an => an
            .Analyzers(anz => anz
                .Custom("result_suggester_analyzer", cc => cc
                    .Tokenizer("standard")
                    .CharFilters("html_strip", "mapped_words_char_filter")
                    .Filters(new string[] { "english_possessive_stemmer", "lowercase", "asciifolding", "stop", "english_stemmer", "edge_ngram_token_filter", "unique" })
                )
            )
            .CharFilters(cf => cf
                .Mapping("mapped_words_char_filter", md => md
                    .Mappings(
                        "C# => csharp",
                        "c# => csharp"
                    )
                )
            )
            .TokenFilters(tfd => tfd
                .EdgeNGram("edge_ngram_token_filter", engd => engd
                    .MinGram(2)
                    .MaxGram(10)
                )
                .Stop("additional_stop_word", sfd => sfd.StopWords(new string[] { "your" }))
                .Stemmer("english_stemmer", esd => esd.Language("english"))
                .Stemmer("english_possessive_stemmer", epsd => epsd.Language("possessive_english"))
            )
        )
    )
    .Mappings(m => m.Map<AdDocument>(d => d.AutoMap())));

Upvotes: 1

Russ Cam
Russ Cam

Reputation: 125488

Completion suggester is designed for fast search-as-you-type prefix queries, using a simple analyzer, and not the standard analyzer which is default for text datatypes.

If you need partial prefix matching on any tokens in the title and not just from the beginning of the title, you may want to consider taking one of these approaches:

  1. use Analyze API with an analyzer that will tokenize the title into tokens/terms from which you would want to partial prefix match, and index this collection as the input to the completion field. The Standard analyzer may be a good one to start with.

    Bear in mind that the data structure for completion suggester is held in memory whilst in use, so high terms cardinality across documents will increase the memory demands of this data structure. Also consider that "scoring" of matching terms is simple in that it is controlled by the weight applied to each input.

or

  1. Don't use the Completion Suggester here and instead set up the title field as a text datatype with multi-fields that include the different ways that title should be analyzed (or not analyzed, with a keyword sub field for example).

    Spend some time with the Analyze API to build an analyzer that will allow for partial prefix of terms anywhere in the title. As a start, something like the Standard tokenizer, Lowercase token filter, Edgengram token filter and possibly Stop token filter would get you running. Also note that you would want a Search analyzer that does something similar to the Index analyzer except Edgengram token filter, as tokens in the search input would not need to be ngrammed.

Upvotes: 2

Related Questions