Reputation: 2759
It seems that if I am running a word or phrase through an ngram filter, the original word does not get indexed. Instead, I only get chunks of the word up to my max_gram value. I would expect the original word to get indexed as well. I'm using Elasticsearch 0.20.5. If I set up an index using a filter with ngrams like so:
CURL -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"filter": {
"my_ngram": {
"max_gram": 10,
"min_gram": 1,
"type": "nGram"
},
"my_stemmer": {
"type": "stemmer",
"name": "english"
}
},
"analyzer": {
"default_index": {
"filter": [
"standard",
"lowercase",
"asciifolding",
"my_ngram",
"my_stemmer"
],
"type": "custom",
"tokenizer": "standard"
},
"default_search": {
"filter": [
"standard",
"lowercase"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}'
Then I put a long word into a document:
CURL -XPUT 'http://localhost:9200/test/item/1' -d '{
"foo" : "REALLY_REALLY_LONG_WORD"
}'
And I query for that long word:
CURL -XGET 'http://localhost:9200/test/item/_search' -d '{
"query":
{
"match" : {
"foo" : "REALLY_REALLY_LONG_WORD"
}
}
}'
I get 0 results. I do get a result if I query for a 10 character chunk of that word. When I run this:
curl -XGET 'localhost:9200/test/_analyze?text=REALLY_REALLY_LONG_WORD
I get tons of grams back, but not the original word. Am I missing a configuration to make this work the way I want?
Upvotes: 1
Views: 999
Reputation: 1621
If you would like to keep the complete word of phrase, use a multi-field mapping for the value where you keep one "not analyzed" or with keyword-tokenizer instead.
Also, when searching a field with nGram-tokenized values, you should probably also use the nGram-tokenizer for the search, then the n-character limit will also apply for the search-phrase, and you will get the expected results.
Upvotes: 3