Maarkoize
Maarkoize

Reputation: 2621

Elasticsearch - Find document by term which is only part of given query-string

I have a problem with searching in elasticsearch and hope that you can help.

I want to find a document which is keyword tokenized, only lowercased by the analyzer in the index. When the generated term is part of the searched query, I want Elasticsearch to find it.

Example search:

 "query": {
    "match": {
      "categoryNames": "CD&DVD Aufbewahrung schwarz"
    }
  }

Document I want to find:

"_source": {
    "categoryId": 11972638,
    "categoryNames": [
        "DVD-Koffer",
        "CD-Koffer",
        "CD-Aufbewahrung",
        "DVD-Aufbwahrung",
        "DVD-Ordner",
        "EDV-DVD-Aufbewahrung",
        "EDV-CD-Aufbewahrung",
        "CD&DVD Aufbewahrung",
        "Multimediabox"
    ],
    "lvl3Id": 11972638
}

Index Analyzer:

"analysis" : {
    "analyzer" : {
        "default" : {
             "type": "custom",
             "tokenizer": "keyword",
             "filter" : ["lowercase"]
         }
      }
 }

Termvectors of the document, which I want to find:

"cd&dvd aufbewahrung": {
    "term_freq": 1,
    "tokens": [
      ...
    ]
},
"cd-aufbewahrung": {
     "term_freq": 1,
     "tokens": [
       ...
      ]
},
"cd-koffer": {
      "term_freq": 1,
      "tokens": [
        ...
       ]
},
....

I have no result. When I am only searching for "CD&DVD aufbewahrung", I find the document.

I think that elasticsearch is trying to find a term "CD&DVD Aufbewahrung schwarz" which not exists, instead of matching "CD&DVD Aufbewahrung" and ignore "schwarz".

The search cannot use the standard analyzer, because it is important that only "CD&DVD Aufbewahrung" find "CD&DVD Aufbewahrung" and not for example a term which only contains "Aufbewahrung" or "Aufbewahrung CD&DVD", which will be found when the term is analyzed by e.g. whitespaces.

A few example searches with my expectations for the document above:

CD&DVD Aufbewahrung -> Found
CD&DVD aufbewahrung -> Found
schwarz CD&DVD Aufbewahrung -> Found
CD&DVD Aufbewahrung gelb -> Found
schwarz CD&DVD Aufbewahrung gelb -> Found
CD&DVD schwarz Aufbewahrung -> not Found
Aufbewahrung CD&DVD -> not Found
schwarz CD & DVD Aufbewahrung -> not Found
schwarzCD&DVD Aufbewahrung -> Not Found

Has anyone an idea how to fix this?

Upvotes: 0

Views: 1322

Answers (1)

Piotr Pradzynski
Piotr Pradzynski

Reputation: 4535

Maybe custom analyzer with Shingle Token Filter will be helpful here. Please see code below:

Mapping

PUT /so53412408
{
  "settings": {
    "analysis": {
      "analyzer": {
        "lowercase_keyword": {
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        },
        "lowercase_shingle": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_shingle"
          ]
        }
      },
      "filter": {
        "my_shingle": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 4
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "categoryNames": {
          "type": "text",
          "analyzer": "lowercase_keyword",
          "search_analyzer": "lowercase_shingle"
        }
      }
    }
  }
}

Sample data

POST /so53412408/_doc
{
  "categoryNames": [
    "DVD-Koffer",
    "CD-Koffer",
    "CD-Aufbewahrung",
    "DVD-Aufbwahrung",
    "DVD-Ordner",
    "EDV-DVD-Aufbewahrung",
    "EDV-CD-Aufbewahrung",
    "CD&DVD Aufbewahrung",
    "Multimediabox"
  ]
}

Search query

GET /so53412408/_search
{
  "query": {
    "match": {
      "categoryNames": "schwarzCD&DVD Aufbewahrung"
    }
  }
}

Results

CD&DVD Aufbewahrung              -> Found
CD&DVD aufbewahrung              -> Found
schwarz CD&DVD Aufbewahrung      -> Found
CD&DVD Aufbewahrung gelb         -> Found
schwarz CD&DVD Aufbewahrung gelb -> Found
CD&DVD schwarz Aufbewahrung      -> Not Found
Aufbewahrung CD&DVD              -> Not Found
schwarz CD & DVD Aufbewahrung    -> Not Found
schwarzCD&DVD Aufbewahrung       -> Not Found

Upvotes: 1

Related Questions