Reputation: 71
I used hyphenation_decompounder for German language and followed the example as mentioned in the documentation. So far so good. it works!. The text kaffeetasse
is tokenized into kaffee
and tasse
.
The concern arose when I use multi-match query for kaffeetasse to find documents where kaffee AND tasse both matches. It seems that multi-match uses OR for the tokens generated by hyphenation_decompounder filter instead of given Operator("AND") in multi-match query. Here is my Test-case
Mapping
curl -XPUT "http://localhost:9200/testidx" -H 'Content-Type: application/json' -d'{ "settings": { "index": { "analysis": { "analyzer": { "index": { "type" : "custom", "tokenizer": "whitespace", "filter": [ "lowercase" ] }, "search": { "type" : "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "hyph" ] } }, "filter": { "hyph": { "type": "hyphenation_decompounder", "hyphenation_patterns_path": "analysis/de_DR.xml", "word_list": ["kaffee", "zucker", "tasse"], "only_longest_match": true, "min_subword_size": 4 } } } } }, "mappings" : { "properties" : { "title" : { "type" : "text", "analyzer": "index", "search_analyzer": "search" }, "description" : { "type" : "text", "analyzer": "index", "search_analyzer": "search" } } } }'
Document id=1
curl -XPOST "http://localhost:9200/testidx/_doc/1" -H 'Content-Type: application/json' -d'{ "title" : "Kaffee", "description": "Milch Kaffee tasse"}'
Document id=2
curl -XPOST "http://localhost:9200/testidx/_doc/2" -H 'Content-Type: application/json' -d'{ "title" : "Kaffee", "description": "Latte Kaffee Becher"}'
Multi-Match query
curl -XGET "http://localhost:9200/testidx/_search" -H 'Content-Type: application/json' -d'{ "query": { "multi_match": { "query": "kaffeetasse", "fields": ["title", "description"], "operator": "and", "type": "cross_fields", "analyzer": "search" } }}'
My expectation is that elasticsearch should return only single document with id=1 as it has kaffee
AND tasse
in its fields but it returns both documents as both have kaffee
OR tasse
text.
Elasticsearch: 7.9.2
de_DR.xml
downloaded from https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download as mentioned in the documentation.
Upvotes: 2
Views: 898
Reputation: 1123
Elasticsearch returns both documents because it applies operator
parameter to the original query kaffeetasse
, not to the tokens kaffee
and tasse
produced by analyzer. Such behavior described in documentation for match
query:
operator (Optional, string) Boolean logic used to interpret text in the
query
value.
Since the original query is one word, the operator
parameter has no sense.
As a workaround you can perform your search in two steps:
Analyze your original query string with analyze API:
curl -XGET "http://localhost:9200/testidx/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "search", "text": "kaffeetasse"}'
Use tokens received from search
analyzer as words for multi_match
query with operator
parameter set to and
and analyzer
parameter set to whitespace
(to prevent already analyzed tokens to be analyzed again with search
analyzer):
curl -XGET "http://localhost:9200/testidx/_search" -H 'Content-Type: application/json' -d'{ "query": {"multi_match": {"query": "kaffee tasse", "fields": ["title", "description"], "operator": "and", "type": "cross_fields", "analyzer": "whitespace"}}}'
Upvotes: 2