Reputation: 1
I'm using almost latest Elastic 8.13 and currently trying to make analyzer with multiplexer, that uses synonym filter. However, I found out that results from simple filter-chaining (without multiplexer) differ from multiplexer with the same token filters.
I made an artificial example, here is two analyzers:
PUT /test-index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"test_analyzer": {
"tokenizer": "classic",
"filter": [
"test_stemmer",
"test_synonym_filter"
]
},
"test_analyzer_multiplexer": {
"tokenizer": "classic",
"filter": [
"multiplexer_custom"
]
}
},
"filter": {
"multiplexer_custom": {
"type": "multiplexer",
"filters": [
"test_stemmer, test_synonym_filter"
],
"preserve_original": true
},
"test_synonym_filter": {
"type": "synonym_graph",
"synonyms": [
"walking, jumping fox"
]
},
"test_stemmer": {
"type": "stemmer",
"language": "english"
}
}
}
}
}
}
They are basically identical and should (I assume) output same results. But when I test it, I get different tokens. For simple filter-chaining everything is ok:
GET test-index/_analyze
{
"analyzer": "test_analyzer",
"text": "jumping fox"
}
Result:
{
"tokens": [
{
"token": "walk",
"start_offset": 0,
"end_offset": 11,
"type": "SYNONYM",
"position": 0,
"positionLength": 2
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "fox",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
I am getting my stemmed synonym "walk", as expected. But if I test analyzer with multiplexer:
GET test-index/_analyze
{
"analyzer": "test_analyzer_multiplexer",
"text": "jumping fox"
}
Result is:
{
"tokens": [
{
"token": "jumping",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "jump",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "fox",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
}
]
}
No synonym at all. I believe it's happening because by default multiplexer preserves original tokens, but where is synonym nevertheless? If I add preserve_original: false
, I am getting right result, but what if I need to keep original tokens AND get synonyms while using multiplexer?
Either it's a kind of bug or I don't fully understand how it should work.
P.S. I saw almost identical topic https://discuss.elastic.co/t/synonym-filter-not-working-within-a-multiplexer-filter/271719, but I believe my case is different - my synonyms doesn't intersect with each other, so RemoveDuplicatesTokenFilter
from Lucene should (I think) work correctly. Maybe something going wrong in other place?
UPD
My overall task is to make a such analyzer, that would emit, ideally: a) original tokens; b) stemmed original tokens AND c) stemmed synonyms. For now I can do either "a" and "b" (with keyword_repeat
filter), or "b" and "c" (because "keyword_repeat" can't be used with synonyms). So it led me to multiplexer, that, as I understood from the docs, can build two (or more) independent token chains and then merge them together. So I wanted to use one chain with "keyword_repeat" to preserve original tokens (although it doesn't seems necessary because multiplexer already have such behavior), and second chain to get stemmed synonyms. But in my experiments with multiplexer I came to a problem, that I described higher in my original post - looks like it doesn't work well with synonyms in general. But maybe I just chose wrong solution for my task in the first place.
Upvotes: 0
Views: 150
Reputation: 30163
If I add preserve_original: false, I am getting right result
Are you sure? I don't see much differences between having preserve_original
set to true
or false
.
Either it's a kind of bug or I don't fully understand how it should work.
It's not a bug; a documented technical limitation of the multiplexer token filter. If you refer to the second warning on the multiplexor token filter documentation page, you will find the following statement (emphasis added):
Shingle or multi-word synonym token filters will not function normally when they are declared in the filters array because they read ahead internally which is unsupported by the multiplexer.
Multi-word synonyms simply cannot function correctly in this setup. I am not sure how the preserve_original
setting would help in this case.
Upvotes: 0