Reputation: 1770
Say I've indexed a document in Elastic with these two keyword
fields:
"lastModified": "02/03/2020"
"theText": "Hello there"
I want to support case-insensitive substring searches against both fields.
So that the doc is matched when I search "lastModified"
with any of these for the QueryString
:
"02"
"2/03"
"/0"
"2020"
And the doc should match when I search "theText"
for any of these (note case changes):
"helLO"
"lo there"
"the"
You get the idea. I just need a simple case-insensitive substring search. No fuzziness or anything fancy. I've tried wildcards, regexes, escaping the slashes for "lastModified"
, remapping /
to _slash_
, and I am stuck. Wildcards work except when there's a slash. How can I get the wildcard approach to work with slashes? Or is there a better way?
I'd prefer to avoid going the N-Gram route, since the text data could be a very long paragraph and it would create many grams :).
To summarize, my preferred solution would:
For now I'm using an ugly Regex against a Keyword field. It works but feels pretty silly.
Upvotes: 0
Views: 879
Reputation: 16192
You need to use an n-gram tokenizer for a substring match. Since you want to keep /
also, then, you need to add punctuation
in the token_chars
as well
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"lastModified": {
"type": "text",
"analyzer": "my_analyzer"
},
"theText": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"theText": "Hello there"
}
{
"lastModified": "02/03/2020"
}
Search Query on field 1:
{
"query": {
"match": {
"theText": "helLO"
}
}
}
Search Result:
"hits": [
{
"_index": "67825121",
"_type": "_doc",
"_id": "2",
"_score": 0.970927,
"_source": {
"theText": "Hello there"
}
}
]
Search Query on field 2:
{
"query": {
"match": {
"lastModified": "2/03"
}
}
}
Search Result:
"hits": [
{
"_index": "67825121",
"_type": "_doc",
"_id": "1",
"_score": 2.0497348,
"_source": {
"lastModified": "02/03/2020"
}
}
]
Have tried other queries also, and they are showing the correct result as per your use case.
Update 1:
Elasticsearch uses a standard analyzer if no analyzer is specified. Assuming that the lastModified
and theText
field is of text
type, so "02/03/2020" will get tokenized into
{
"tokens": [
{
"token": "02",
"start_offset": 0,
"end_offset": 2,
"type": "<NUM>",
"position": 0
},
{
"token": "03",
"start_offset": 3,
"end_offset": 5,
"type": "<NUM>",
"position": 1
},
{
"token": "2020",
"start_offset": 6,
"end_offset": 10,
"type": "<NUM>",
"position": 2
}
]
}
Now, when you are doing a wildcard query on any of the above field, then it will search for the tokens shown above. Since there is no token that matches "2/03", you will be getting empty results for the query.
It is better to use a keyword field for wildcard queries. If you have not explicitly defined any index mapping then you need to add .keyword to both the fields. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after the fields).
Search Query:
{
"query": {
"wildcard": {
"lastModified.keyword": {
"value": "*2/03*"
}
}
}
}
Search Result:
"hits": [
{
"_index": "67825121",
"_type": "_doc",
"_id": "2",
"_score": 1.0,
"_source": {
"lastModified": "02/03/2020"
}
}
]
Search Query:
{
"query": {
"wildcard": {
"theText.keyword": {
"value": "*lo there*"
}
}
}
}
Search Result:
"hits": [
{
"_index": "67825121",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"theText": "Hello there"
}
}
]
Upvotes: 1