user5301398
user5301398

Reputation: 25

Unexpected Removal of Periods (.) in Tokens with Custom Elasticsearch pipeAnalyzer

I have configured a custom analyzer in Elasticsearch called pipeAnalyzer that is intended to tokenize strings using the pipe (|) character as a delimiter, while also applying lowercase conversion, ASCII folding, and trimming of whitespace from the start and end of tokens. However, I'm encountering an unexpected behavior where the analyzer also seems to remove periods (.) from the tokens, which was not intended.

Here is the configuration of my pipeAnalyzer:

{
  "pipeAnalyzer": {
    "type": "custom",
    "tokenizer": "pattern",
    "pattern": "\\|",
    "filter": ["trim", "lowercase", "asciifolding"]
  }
}

For example, when analyzing the string 75207149.0.1_sb.research.com, I expected the output to retain the periods as part of the tokens. Instead, the analyzer splits the string in such a way that periods are removed, producing tokens like 75207149, 0, 1_sb, research, and com, instead of treating the entire string as a single token or at least preserving the periods within tokens.

This behavior is puzzling because the configured pattern tokenizer uses a pipe (|) as the delimiter, and there's no indication that periods should be removed or treated as delimiters.

Questions:

Why does my pipeAnalyzer remove periods from the tokens, despite the tokenizer being configured to use the pipe character as the delimiter? How can I adjust my pipeAnalyzer configuration to ensure that periods are not removed from tokens during analysis?

Any insights or suggestions on how to address this issue would be greatly appreciated. I'm using Elasticsearch version 7.17.

Upvotes: 0

Views: 52

Answers (1)

Sajad Soltanian
Sajad Soltanian

Reputation: 121

Solution

As I have tested your analyzer settings, I have noticed that as you said, it will remove the dots. which was a bit weird actually! In another test, I have first extracted the tokenizer definition section from the analyzer body so that it works correctly!

simply test the below settings:

{
    "settings": {
      "analysis": {
        "analyzer": {
          "pipeAnalyzer": {
            "tokenizer": "my_tokenizer",
            "filter": ["trim", "lowercase", "asciifolding"]
          }
        },
        "tokenizer": {
          "my_tokenizer": {
            "type": "pattern",
            "pattern": "\\|"
          }
        }
      }
  }
}

I was wondering why such a thing happens, though it seems that based on the official documentations of Elasticsearch on creating your custom analyzer, the tokenizer should be defined separately and only you should use its final name there.

Upvotes: 1

Related Questions