Reputation: 5291
I have been trying to get trigrams with elasticsearch tokenizers. I have followed tutorials at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html and http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
Following these docs and testing the analyzer with
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
produces nGrams like # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
While what I want is whole word trigrams
for example trigrams for the quick red fox jumps over the lazy brown dog
would be.
the quick red
quick red fox
red fox jumps
fox jumps over
jumps over the
over the lazy
the lazy brown
lazy brown dog
In a nutshell how can I create trgrams like above using elasticsearch
Upvotes: 2
Views: 3678
Reputation: 5291
Found it. Answer lies in the shingle filter. This mapping made it work
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "shingle",
"max_shingle_size": 3,
"min_shingle_size": 3,
output_unigrams:false
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
Here key attributes are type->shingle and min/max shingle sizes.
Upvotes: 3