Developer
Developer

Reputation: 457

elasticsearch edge n-gram tokenizer: include symbols in the tokens

I am using a custom tokenizer based on the Edge NGram tokenizer, and I would like to be able to search for strings like "sport+", i.e., I would like the special symbols, e.g., the + sign to be considered part of the token.

For example, we have documents with the following fields:

"typeName": "LC 500h Sport+ CVT" or "typeName": "LC 500h Sport CVT".

Executing a query with the following clause:

{
  "match": {
    "typeName": {
      "query": "sport+ cvt",
        "operator": "and"
    }
  }
}

fetches both documents. However, we would only like the document with "typeName": "LC 500h Sport+ CVT" to be returned in this case.

We have been using the following token_chars classes in the tokenizer settings: digit, letter, punctuation. I thought that adding symbol as a token_chars class and recreating the index would do the trick, but it has not helped.

EDIT: Here is the analyzer definition in Nest syntax:

Settings(s => s
    .Analysis(_ =>
        _.Analyzers(a => a
                .Custom(
                    "vehicleanalyzer",
                    descriptor => descriptor
                        .Tokenizer(vehicleEdgeNgram)
                        .Filters("lowercase"))
                  .Standard("vehiclesearch",
                  descriptor => descriptor))
            .Tokenizers(descriptor => descriptor
                .EdgeNGram(
                    vehicleEdgeNgram,
                    tokenizerDescriptor =>
                        tokenizerDescriptor
                            .MinGram(1)
                            .MaxGram(10)
                            .TokenChars(
                                TokenChar.Digit,
                                TokenChar.Letter,
                                TokenChar.Punctuation,
                                TokenChar.Symbol)))))

Upvotes: 1

Views: 1438

Answers (1)

Lupanoide
Lupanoide

Reputation: 3212

As written in the documentation token_chars are:

Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).

By default elasticsearch keep all the chars. You should use this option only if you want less classes of chars in your inverted index. So to resolve your problem you should simply remove the definition of token_chars: your tokenizer will keep all chars

Upvotes: 1

Related Questions