Unexpected Removal of Periods (.) in Tokens with Custom Elasticsearch pipeAnalyzer

Question

I have configured a custom analyzer in Elasticsearch called pipeAnalyzer that is intended to tokenize strings using the pipe (|) character as a delimiter, while also applying lowercase conversion, ASCII folding, and trimming of whitespace from the start and end of tokens. However, I'm encountering an unexpected behavior where the analyzer also seems to remove periods (.) from the tokens, which was not intended.

Here is the configuration of my pipeAnalyzer:

{
  "pipeAnalyzer": {
    "type": "custom",
    "tokenizer": "pattern",
    "pattern": "\|",
    "filter": ["trim", "lowercase", "asciifolding"]
  }
}

For example, when analyzing the string 75207149.0.1_sb.research.com, I expected the output to retain the periods as part of the tokens. Instead, the analyzer splits the string in such a way that periods are removed, producing tokens like 75207149, 0, 1_sb, research, and com, instead of treating the entire string as a single token or at least preserving the periods within tokens.

This behavior is puzzling because the configured pattern tokenizer uses a pipe (|) as the delimiter, and there's no indication that periods should be removed or treated as delimiters.

Questions:

Why does my pipeAnalyzer remove periods from the tokens, despite the tokenizer being configured to use the pipe character as the delimiter? How can I adjust my pipeAnalyzer configuration to ensure that periods are not removed from tokens during analysis?

Any insights or suggestions on how to address this issue would be greatly appreciated. I'm using Elasticsearch version 7.17.

Unexpected Removal of Periods (.) in Tokens with Custom Elasticsearch pipeAnalyzer

Answers (1)

Solution

Related Questions