Reputation: 10939
I want to do a wildcard query for QNMZ-1900
As I read in the docs, and tried by myself, the standard tokenizer of Elasticsearch splits the words on hyphens, for example QNMZ-1900
will be split to QNMZ
and 1900
.
To prevent this behavior, I'm using the not_analyzed
feature.
curl -XPUT 'localhost:9200/test-idx' -d '{
"mappings": {
"doc": {
"properties": {
"foo" : {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
I'm putting something into my index:
curl -XPUT 'localhost:9200/test-idx/doc/1' -d '{"foo": "QNMZ-1900"}'
Refreshing it:
curl -XPOST 'localhost:9200/test-idx/_refresh'
Now I can use a wildcard query and find QNMZ-1900
:
curl 'localhost:9200/test-idx/doc/_search?pretty=true' -d '{
"query": {
"wildcard" : { "foo" : "QNMZ-19*" }
}
My question:
How I can run a wildcard query with a lowercase search term ?
I've tried:
curl -XDELETE 'localhost:9200/test-idx'
curl -XPUT 'localhost:9200/test-idx' -d '{
"mappings": {
"doc": {
"properties": {
"foo" : {
"type": "string",
"index": "not_analyzed",
"filter": "lowercase"
}
}
}
}
}'
curl -XPUT 'localhost:9200/test-idx/doc/1' -d '{"foo": "QNMZ-1900"}'
curl -XPOST 'localhost:9200/test-idx/_refresh'
but my lowercase query:
curl 'localhost:9200/test-idx/doc/_search?pretty=true' -d '{
"query": {
"wildcard" : { "foo" : "qnmz-19*" }
}
}'
doesn't find anything.
How to fix it ?
Upvotes: 5
Views: 7497
Reputation: 408
I have checked this aproach in my pet project based on ES 6.1. Data model like below allows searching as expected in question:
PUT test-idx
{
"settings": {
"analysis": {
"analyzer": {
"keylower": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
}
}
}
}
POST /test-idx/doc/_mapping
{
"properties": {
"foo": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"lowercase_foo": {
"type": "text",
"analyzer": "keylower"
}
}
}
}
}
PUT /test-idx/doc/1
{"foo": "QNMZ-1900"}
Check resoults of these two searches. First will resoult one hit. Second one will return 0 hits.
GET /test-idx/doc/_search
{
"query": {
"wildcard" : { "foo.lowercase_foo" : "qnmz-19*" }
}
}
GET /test-idx/doc/_search
{
"query": {
"wildcard" : { "foo" : "qnmz-19*" }
}
}
Thanks @ThomasC for an opinion. Please be carefull with my answer. I am just learning Elasticsearch. I am not an expert in this database. I don't know is it production ready advice!
Upvotes: 0
Reputation: 8165
One solution is to define a custom analyzer using
keyword
tokenizer (which keeps the input value as it is, as if it was not_analyzed
)lowercase
tokenfilterI've tried this :
POST test-idx
{
"index":{
"analysis":{
"analyzer":{
"lowercase_hyphen":{
"type":"custom",
"tokenizer":"keyword",
"filter":["lowercase"]
}
}
}
}
}
PUT test-idx/doc/_mapping
{
"doc":{
"properties": {
"foo" : {
"type": "string",
"analyzer": "lowercase_hyphen"
}
}
}
}
POST test-idx/doc
{
"foo":"QNMZ-1900"
}
As you can see using the _analyze endpoint like this :
GET test-idx/_analyze?analyzer=lowercase_hyphen&text=QNMZ-1900
outputs only one token lowercased but not split on hyphens :
{
"tokens": [
{
"token": "qnmz-1900",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 1
}
]
}
Then, using the same query :
POST test-idx/doc/_search
{
"query": {
"wildcard" : { "foo" : "qnmz-19*" }
}
}
I have this result, which is what you want:
{
"took": 66,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test-idx",
"_type": "doc",
"_id": "wo1yanIjQGmvgfScMg4hyg",
"_score": 1,
"_source": {
"foo": "QNMZ-1900"
}
}
]
}
}
However, please note that this will allow you to query only using lowercased value.
As stated by Andrei in comment, the same query with value QNMZ-19*
won't return anything.
The reason can be found in the documentation : at search time, the value isn't analyzed.
Upvotes: 8