Reputation: 2621
I have a problem with searching in elasticsearch and hope that you can help.
I want to find a document which is keyword tokenized, only lowercased by the analyzer in the index. When the generated term is part of the searched query, I want Elasticsearch to find it.
Example search:
"query": {
"match": {
"categoryNames": "CD&DVD Aufbewahrung schwarz"
}
}
Document I want to find:
"_source": {
"categoryId": 11972638,
"categoryNames": [
"DVD-Koffer",
"CD-Koffer",
"CD-Aufbewahrung",
"DVD-Aufbwahrung",
"DVD-Ordner",
"EDV-DVD-Aufbewahrung",
"EDV-CD-Aufbewahrung",
"CD&DVD Aufbewahrung",
"Multimediabox"
],
"lvl3Id": 11972638
}
Index Analyzer:
"analysis" : {
"analyzer" : {
"default" : {
"type": "custom",
"tokenizer": "keyword",
"filter" : ["lowercase"]
}
}
}
Termvectors of the document, which I want to find:
"cd&dvd aufbewahrung": {
"term_freq": 1,
"tokens": [
...
]
},
"cd-aufbewahrung": {
"term_freq": 1,
"tokens": [
...
]
},
"cd-koffer": {
"term_freq": 1,
"tokens": [
...
]
},
....
I have no result. When I am only searching for "CD&DVD aufbewahrung", I find the document.
I think that elasticsearch is trying to find a term "CD&DVD Aufbewahrung schwarz" which not exists, instead of matching "CD&DVD Aufbewahrung" and ignore "schwarz".
The search cannot use the standard analyzer, because it is important that only "CD&DVD Aufbewahrung" find "CD&DVD Aufbewahrung" and not for example a term which only contains "Aufbewahrung" or "Aufbewahrung CD&DVD", which will be found when the term is analyzed by e.g. whitespaces.
A few example searches with my expectations for the document above:
CD&DVD Aufbewahrung -> Found
CD&DVD aufbewahrung -> Found
schwarz CD&DVD Aufbewahrung -> Found
CD&DVD Aufbewahrung gelb -> Found
schwarz CD&DVD Aufbewahrung gelb -> Found
CD&DVD schwarz Aufbewahrung -> not Found
Aufbewahrung CD&DVD -> not Found
schwarz CD & DVD Aufbewahrung -> not Found
schwarzCD&DVD Aufbewahrung -> Not Found
Has anyone an idea how to fix this?
Upvotes: 0
Views: 1322
Reputation: 4535
Maybe custom analyzer with Shingle Token Filter will be helpful here. Please see code below:
PUT /so53412408
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_keyword": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
},
"lowercase_shingle": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"my_shingle"
]
}
},
"filter": {
"my_shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 4
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"categoryNames": {
"type": "text",
"analyzer": "lowercase_keyword",
"search_analyzer": "lowercase_shingle"
}
}
}
}
}
POST /so53412408/_doc
{
"categoryNames": [
"DVD-Koffer",
"CD-Koffer",
"CD-Aufbewahrung",
"DVD-Aufbwahrung",
"DVD-Ordner",
"EDV-DVD-Aufbewahrung",
"EDV-CD-Aufbewahrung",
"CD&DVD Aufbewahrung",
"Multimediabox"
]
}
GET /so53412408/_search
{
"query": {
"match": {
"categoryNames": "schwarzCD&DVD Aufbewahrung"
}
}
}
CD&DVD Aufbewahrung -> Found
CD&DVD aufbewahrung -> Found
schwarz CD&DVD Aufbewahrung -> Found
CD&DVD Aufbewahrung gelb -> Found
schwarz CD&DVD Aufbewahrung gelb -> Found
CD&DVD schwarz Aufbewahrung -> Not Found
Aufbewahrung CD&DVD -> Not Found
schwarz CD & DVD Aufbewahrung -> Not Found
schwarzCD&DVD Aufbewahrung -> Not Found
Upvotes: 1