In Elasticsearch, Why do I lose the whole word token when I run a word through an ngram filter?

Question

It seems that if I am running a word or phrase through an ngram filter, the original word does not get indexed. Instead, I only get chunks of the word up to my max_gram value. I would expect the original word to get indexed as well. I'm using Elasticsearch 0.20.5. If I set up an index using a filter with ngrams like so:

CURL -XPUT 'http://localhost:9200/test/' -d '{
    "settings": {
        "analysis": {
            "filter": {
                "my_ngram": {
                    "max_gram": 10,
                    "min_gram": 1,
                    "type": "nGram"
                },
                "my_stemmer": {
                    "type": "stemmer",
                    "name": "english"
                }
            },
            "analyzer": {
                "default_index": {
                    "filter": [
                        "standard",
                        "lowercase",
                        "asciifolding",
                        "my_ngram",
                        "my_stemmer"
                    ],
                    "type": "custom",
                    "tokenizer": "standard"
                },
                "default_search": {
                    "filter": [
                        "standard",
                        "lowercase"
                    ],
                    "type": "custom",
                    "tokenizer": "standard"
                }
            }
        }
    }
}'

Then I put a long word into a document:

CURL -XPUT 'http://localhost:9200/test/item/1' -d '{
     "foo" : "REALLY_REALLY_LONG_WORD"
 }'

And I query for that long word:

CURL -XGET 'http://localhost:9200/test/item/_search' -d '{
  "query":
 {
     "match" : {
         "foo" : "REALLY_REALLY_LONG_WORD"
     }
 }
 }'

I get 0 results. I do get a result if I query for a 10 character chunk of that word. When I run this:

curl -XGET 'localhost:9200/test/_analyze?text=REALLY_REALLY_LONG_WORD

I get tons of grams back, but not the original word. Am I missing a configuration to make this work the way I want?

runarM · Accepted Answer

If you would like to keep the complete word of phrase, use a multi-field mapping for the value where you keep one "not analyzed" or with keyword-tokenizer instead.

Also, when searching a field with nGram-tokenized values, you should probably also use the nGram-tokenizer for the search, then the n-character limit will also apply for the search-phrase, and you will get the expected results.

In Elasticsearch, Why do I lose the whole word token when I run a word through an ngram filter?

Answers (1)

Related Questions