elasticsearch edge n-gram tokenizer: include symbols in the tokens

Question

I am using a custom tokenizer based on the Edge NGram tokenizer, and I would like to be able to search for strings like "sport+", i.e., I would like the special symbols, e.g., the + sign to be considered part of the token.

For example, we have documents with the following fields:

"typeName": "LC 500h Sport+ CVT" or "typeName": "LC 500h Sport CVT".

Executing a query with the following clause:

{
  "match": {
    "typeName": {
      "query": "sport+ cvt",
        "operator": "and"
    }
  }
}

fetches both documents. However, we would only like the document with "typeName": "LC 500h Sport+ CVT" to be returned in this case.

We have been using the following token_chars classes in the tokenizer settings: digit, letter, punctuation. I thought that adding symbol as a token_chars class and recreating the index would do the trick, but it has not helped.

EDIT: Here is the analyzer definition in Nest syntax:

Settings(s => s
    .Analysis(_ =>
        _.Analyzers(a => a
                .Custom(
                    "vehicleanalyzer",
                    descriptor => descriptor
                        .Tokenizer(vehicleEdgeNgram)
                        .Filters("lowercase"))
                  .Standard("vehiclesearch",
                  descriptor => descriptor))
            .Tokenizers(descriptor => descriptor
                .EdgeNGram(
                    vehicleEdgeNgram,
                    tokenizerDescriptor =>
                        tokenizerDescriptor
                            .MinGram(1)
                            .MaxGram(10)
                            .TokenChars(
                                TokenChar.Digit,
                                TokenChar.Letter,
                                TokenChar.Punctuation,
                                TokenChar.Symbol)))))

elasticsearch edge n-gram tokenizer: include symbols in the tokens

Answers (1)

Related Questions