Reputation: 15559
want to build the auto complete functionality of an e-commerce website, using Completion Suggester.
This is my Index:
PUT myIndex
{
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "completion"
},
"title" : {
"type": "keyword"
},
"category" : {
"type": "keyword"
},
"description" : {
"type": "keyword"
}
}
}
}
}
Now, when uploading the advertisement I want the title field to be used for auto complete, so this is how I upload a document:
POST dummy/_doc
{
"title": "Blue asics running shoes",
"category": "sports",
"description": "Nice blue running shoes, size 44 eu",
"suggest": {
"input": "Blue Asics running shoes" // <-- use title
}
}
Problem is, this way, elastic search only matches the string from beginning... i.e. "Blu" will find result but "Asic" or "Run" or "Sho" won't return anything...
So what I need to do is to tokenize my input like this:
POST dummy/_doc
{
"title": "Blue asics running shoes",
"category": "sports",
"description": "Nice blue running shoes, size 44 eu",
"suggest": {
"input": ["Blue", "Asics", "running", "shoes"] // <-- tokenized title
}
}
This would work fine... But how am I supposed to tokenize my field? I know I can split the string in c#, but is there anyway that I can do this in Elasticsearch/Nest?
Upvotes: 3
Views: 2403
Reputation: 15559
Based on Russ Cam's answer above (option 2), this Elasticsearch guide and also this document, I ended up with the following solution:
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"edge_ngram_token_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
},
"additional_stop_words": {
"type": "stop",
"stopwords": ["your"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"C# => csharp",
"c# => csharp"
]
}
},
"analyzer": {
"result_suggester_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [ "html_strip", "my_char_filter" ],
"filter": [
"english_possessive_stemmer",
"lowercase",
"asciifolding",
"stop",
"additional_stop_words",
"english_stemmer",
"edge_ngram_token_filter",
"unique"
]
}
}
}
}
}
Query to test this solution:
POST my_index/_analyze
{
"analyzer": "result_suggester_analyzer",
"text": "C# & SQL are great languages. K2 is the mountaineer's mountain. Your house-décor is à la Mode"
}
I would get these tokens (NGrams):
cs, csh, csha, cshar, csharp, sq, sql, gr, gre, grea, great, la, lan, lang,
langu, langua, languag, k2, mo, mou, moun, mount, mounta, mountai, mountain,
ho, hou, hous, hous, de, dec, deco, decor, mod, mode
Things to note here:
stop
filter, which is the default English language
filter and is blocking are, is, the
- but not your
. additional_stop_words
, which stops your
english
& possessive_english
stemmers, which would tokenize the words stems: that's why we have languag token but not language or languages... also note that we have mountain but not mountaineering.mapped_words_char_filter
which convert C# to csharp, without this c# would not be a valid token... (this setting would not tokenize F#)html_strip
, char_filter
which converts &
to &, and it is ignored since our min_gram = 2asciifolding
token filter and that's why décor is tokenized as decor.This is the NEST code for the above:
var createIndexResponse = ElasticClient.CreateIndex(IndexName, c => c
.Settings(st => st
.Analysis(an => an
.Analyzers(anz => anz
.Custom("result_suggester_analyzer", cc => cc
.Tokenizer("standard")
.CharFilters("html_strip", "mapped_words_char_filter")
.Filters(new string[] { "english_possessive_stemmer", "lowercase", "asciifolding", "stop", "english_stemmer", "edge_ngram_token_filter", "unique" })
)
)
.CharFilters(cf => cf
.Mapping("mapped_words_char_filter", md => md
.Mappings(
"C# => csharp",
"c# => csharp"
)
)
)
.TokenFilters(tfd => tfd
.EdgeNGram("edge_ngram_token_filter", engd => engd
.MinGram(2)
.MaxGram(10)
)
.Stop("additional_stop_word", sfd => sfd.StopWords(new string[] { "your" }))
.Stemmer("english_stemmer", esd => esd.Language("english"))
.Stemmer("english_possessive_stemmer", epsd => epsd.Language("possessive_english"))
)
)
)
.Mappings(m => m.Map<AdDocument>(d => d.AutoMap())));
Upvotes: 1
Reputation: 125488
Completion suggester is designed for fast search-as-you-type prefix queries, using a simple
analyzer, and not the standard
analyzer which is default for text
datatypes.
If you need partial prefix matching on any tokens in the title and not just from the beginning of the title, you may want to consider taking one of these approaches:
use Analyze API with an analyzer that will tokenize the title into tokens/terms from which you would want to partial prefix match, and index this collection as the input
to the completion
field. The Standard analyzer may be a good one to start with.
Bear in mind that the data structure for completion suggester is held in memory whilst in use, so high terms cardinality across documents will increase the memory demands of this data structure. Also consider that "scoring" of matching terms is simple in that it is controlled by the weight applied to each input.
or
Don't use the Completion Suggester here and instead set up the title
field as a text
datatype with multi-fields that include the different ways that title
should be analyzed (or not analyzed, with a keyword
sub field for example).
Spend some time with the Analyze API to build an analyzer that will allow for partial prefix of terms anywhere in the title. As a start, something like the Standard tokenizer, Lowercase token filter, Edgengram token filter and possibly Stop token filter would get you running. Also note that you would want a Search analyzer that does something similar to the Index analyzer except Edgengram token filter, as tokens in the search input would not need to be ngrammed.
Upvotes: 2