Philipp
Philipp

Reputation: 4270

Elasticsearc - nGram filter preserve/keep original token

I am applying an ngram-filter to my string field:

"custom_ngram": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 10
}

But as a result i loose tokens shorter or longer than the ngram range.

Original tokens like "iq" or "a4" for example can not be found.

I am already applying some language specific analysis before ngram, so i would like to avoid copying the whole field. I am looking to expand the tokens with ngrams.

Any ideas or ngram-suggestions?

Here is an example of one of my analyzers that use the custom_ngram filter:

"french": {
    "type":"custom",
    "tokenizer": "standard",
    "filter": [
        "french_elision",
        "lowercase",
        "french_stop",
        "custom_ascii_folding",
        "french_stemmer",
        "custom_ngram"
    ]
}

Upvotes: 1

Views: 1164

Answers (3)

varunbachalli
varunbachalli

Reputation: 1

I'm not sure if the option existed before. But the solution now is

"custom_ngram": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 10,
    "preserve_original" : true
}

Upvotes: 0

Philipp
Philipp

Reputation: 4270

As Andrei Stefan pointed out, I had to go with multi_fields.

I did and my mapping (for french) now looks like this:

                "french_strings": {
                    "match": "*_fr",
                    "match_mapping_type": "string",
                    "mapping": {
                        "type": "string",
                        "analyzer": "french",
                        "fields":{
                            "ngram":{
                                "type":"string",
                                "index":"analyzed",
                                "analyzer":"ngram",
                                "search_analyzer": "default_search"
                            }
                        }
                    }
                }

I decided to remove the ngram filter from the french analyzer and use an "custom ngram-only" analyzer for the subfield .ngram. This results in a french analyzed field and an "original-to-ngram" subfield.

Upvotes: 0

Andrei Stefan
Andrei Stefan

Reputation: 52368

You have no option than to use multi fields and index that field with a different analyzer that is able to keep the shorter terms as well. Something like that:

    "text": {
      "type": "string",
      "analyzer": "french",
      "fields": {
        "standard_version": {
          "type": "string",
          "analyzer": "standard"
        }
      }
    }

And adjust the queries to also touch the text.standard_version field as well.

Upvotes: 1

Related Questions