Noor
Noor

Reputation: 71

Elasticsearch Multi-Match Query with AND operator for the tokens generated by Hyphenation_decompounder token filter

I used hyphenation_decompounder for German language and followed the example as mentioned in the documentation. So far so good. it works!. The text kaffeetasse is tokenized into kaffee and tasse.

The concern arose when I use multi-match query for kaffeetasse to find documents where kaffee AND tasse both matches. It seems that multi-match uses OR for the tokens generated by hyphenation_decompounder filter instead of given Operator("AND") in multi-match query. Here is my Test-case

Mapping

curl -XPUT "http://localhost:9200/testidx" -H 'Content-Type: application/json' -d'{  "settings": {    "index": {      "analysis": {        "analyzer": {          "index": {            "type" : "custom",            "tokenizer": "whitespace",            "filter": [ "lowercase" ]          },          "search": {            "type" : "custom",            "tokenizer": "whitespace",            "filter": [ "lowercase", "hyph" ]          }        },        "filter": {          "hyph": {            "type": "hyphenation_decompounder",            "hyphenation_patterns_path": "analysis/de_DR.xml",            "word_list": ["kaffee", "zucker", "tasse"],            "only_longest_match": true,            "min_subword_size": 4          }        }      }    }  },    "mappings" : {      "properties" : {        "title" : {          "type" : "text",          "analyzer": "index",          "search_analyzer": "search"        },        "description" : {          "type" : "text",          "analyzer": "index",          "search_analyzer": "search"        }      }    }  }' 

Document id=1

curl -XPOST "http://localhost:9200/testidx/_doc/1" -H 'Content-Type: application/json' -d'{  "title" : "Kaffee",  "description": "Milch Kaffee tasse"}' 

Document id=2

curl -XPOST "http://localhost:9200/testidx/_doc/2" -H 'Content-Type: application/json' -d'{  "title" : "Kaffee",  "description": "Latte Kaffee Becher"}' 

Multi-Match query

curl -XGET "http://localhost:9200/testidx/_search" -H 'Content-Type: application/json' -d'{  "query": {    "multi_match": {      "query": "kaffeetasse",      "fields": ["title", "description"],      "operator": "and",     "type": "cross_fields",     "analyzer": "search"    }  }}'

My expectation is that elasticsearch should return only single document with id=1 as it has kaffee AND tasse in its fields but it returns both documents as both have kaffee OR tasse text.

Elasticsearch: 7.9.2

de_DR.xml downloaded from https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download as mentioned in the documentation.

Upvotes: 2

Views: 898

Answers (1)

Alexey Prudnikov
Alexey Prudnikov

Reputation: 1123

Elasticsearch returns both documents because it applies operator parameter to the original query kaffeetasse, not to the tokens kaffee and tasse produced by analyzer. Such behavior described in documentation for match query:

operator (Optional, string) Boolean logic used to interpret text in the query value.

Since the original query is one word, the operator parameter has no sense.

As a workaround you can perform your search in two steps:

  1. Analyze your original query string with analyze API:

     curl -XGET "http://localhost:9200/testidx/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "search", "text": "kaffeetasse"}'
    
  2. Use tokens received from search analyzer as words for multi_match query with operator parameter set to and and analyzer parameter set to whitespace (to prevent already analyzed tokens to be analyzed again with search analyzer):

     curl -XGET "http://localhost:9200/testidx/_search" -H 'Content-Type: application/json' -d'{ "query": {"multi_match": {"query": "kaffee tasse", "fields": ["title", "description"], "operator": "and", "type": "cross_fields", "analyzer": "whitespace"}}}'
    

Upvotes: 2

Related Questions