Kozhevnikov Nikita
Kozhevnikov Nikita

Reputation: 1

Elasticsearch multiplexer token filter with synonyms doesn’t work as expected

I'm using almost latest Elastic 8.13 and currently trying to make analyzer with multiplexer, that uses synonym filter. However, I found out that results from simple filter-chaining (without multiplexer) differ from multiplexer with the same token filters.

I made an artificial example, here is two analyzers:

PUT /test-index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "test_analyzer": {
            "tokenizer": "classic",
            "filter": [
                "test_stemmer",
                "test_synonym_filter"
              ]
          },
          "test_analyzer_multiplexer": {
            "tokenizer": "classic",
            "filter": [
              "multiplexer_custom"
              ]
          }
        },
        "filter": {
          "multiplexer_custom": {
            "type": "multiplexer",
            "filters": [
              "test_stemmer, test_synonym_filter"
              ],
            "preserve_original": true
          },
          "test_synonym_filter": {
            "type": "synonym_graph",
            "synonyms": [
              "walking, jumping fox"
              ]
          },
          "test_stemmer": {
            "type": "stemmer",
            "language": "english"
          }
        }
      }
    }
  }
}

They are basically identical and should (I assume) output same results. But when I test it, I get different tokens. For simple filter-chaining everything is ok:

GET test-index/_analyze
{
  "analyzer": "test_analyzer",
  "text": "jumping fox"
}

Result:

{
  "tokens": [
    {
      "token": "walk",
      "start_offset": 0,
      "end_offset": 11,
      "type": "SYNONYM",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "jump",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "fox",
      "start_offset": 8,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

I am getting my stemmed synonym "walk", as expected. But if I test analyzer with multiplexer:

GET test-index/_analyze
{
  "analyzer": "test_analyzer_multiplexer",
  "text": "jumping fox"
}

Result is:

{
  "tokens": [
    {
      "token": "jumping",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "jump",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "fox",
      "start_offset": 8,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

No synonym at all. I believe it's happening because by default multiplexer preserves original tokens, but where is synonym nevertheless? If I add preserve_original: false, I am getting right result, but what if I need to keep original tokens AND get synonyms while using multiplexer?

Either it's a kind of bug or I don't fully understand how it should work.

P.S. I saw almost identical topic https://discuss.elastic.co/t/synonym-filter-not-working-within-a-multiplexer-filter/271719, but I believe my case is different - my synonyms doesn't intersect with each other, so RemoveDuplicatesTokenFilter from Lucene should (I think) work correctly. Maybe something going wrong in other place?

UPD

My overall task is to make a such analyzer, that would emit, ideally: a) original tokens; b) stemmed original tokens AND c) stemmed synonyms. For now I can do either "a" and "b" (with keyword_repeat filter), or "b" and "c" (because "keyword_repeat" can't be used with synonyms). So it led me to multiplexer, that, as I understood from the docs, can build two (or more) independent token chains and then merge them together. So I wanted to use one chain with "keyword_repeat" to preserve original tokens (although it doesn't seems necessary because multiplexer already have such behavior), and second chain to get stemmed synonyms. But in my experiments with multiplexer I came to a problem, that I described higher in my original post - looks like it doesn't work well with synonyms in general. But maybe I just chose wrong solution for my task in the first place.

Upvotes: 0

Views: 150

Answers (1)

imotov
imotov

Reputation: 30163

If I add preserve_original: false, I am getting right result

Are you sure? I don't see much differences between having preserve_original set to true or false.

Either it's a kind of bug or I don't fully understand how it should work.

It's not a bug; a documented technical limitation of the multiplexer token filter. If you refer to the second warning on the multiplexor token filter documentation page, you will find the following statement (emphasis added):

Shingle or multi-word synonym token filters will not function normally when they are declared in the filters array because they read ahead internally which is unsupported by the multiplexer.

Multi-word synonyms simply cannot function correctly in this setup. I am not sure how the preserve_original setting would help in this case.

Upvotes: 0

Related Questions