astropanic
astropanic

Reputation: 10939

Elasticsearch wildcard query with hyphens and lowercase filter

I want to do a wildcard query for QNMZ-1900

As I read in the docs, and tried by myself, the standard tokenizer of Elasticsearch splits the words on hyphens, for example QNMZ-1900 will be split to QNMZ and 1900.

To prevent this behavior, I'm using the not_analyzed feature.

curl -XPUT 'localhost:9200/test-idx' -d '{
"mappings": {
    "doc": {
        "properties": {
            "foo" : {
                "type": "string",
                "index": "not_analyzed"
            }
        }
    }
}
}'

I'm putting something into my index:

curl -XPUT 'localhost:9200/test-idx/doc/1' -d '{"foo": "QNMZ-1900"}'

Refreshing it:

curl -XPOST 'localhost:9200/test-idx/_refresh'

Now I can use a wildcard query and find QNMZ-1900 :

curl 'localhost:9200/test-idx/doc/_search?pretty=true' -d '{
"query": {
     "wildcard" : { "foo" : "QNMZ-19*" }
}

My question:

How I can run a wildcard query with a lowercase search term ?

I've tried:

curl -XDELETE 'localhost:9200/test-idx'
curl -XPUT 'localhost:9200/test-idx' -d '{
"mappings": {
    "doc": {
        "properties": {
            "foo" : {
                "type": "string",
                "index": "not_analyzed",
                "filter": "lowercase"
            }
        }
    }
}
}'
curl -XPUT 'localhost:9200/test-idx/doc/1' -d '{"foo": "QNMZ-1900"}'
curl -XPOST 'localhost:9200/test-idx/_refresh'

but my lowercase query:

curl 'localhost:9200/test-idx/doc/_search?pretty=true' -d '{
"query": {
     "wildcard" : { "foo" : "qnmz-19*" }
}
}'

doesn't find anything.

How to fix it ?

Upvotes: 5

Views: 7497

Answers (2)

Radosław Osiński
Radosław Osiński

Reputation: 408

I have checked this aproach in my pet project based on ES 6.1. Data model like below allows searching as expected in question:

PUT test-idx
{
    "settings": {
        "analysis": {
            "analyzer": {
                "keylower": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": ["lowercase"]
                }
            }
        }
    }
}

POST /test-idx/doc/_mapping
{
    "properties": {
        "foo": {
            "type": "text",
            "fields": {
                "raw": {
                    "type": "keyword"
                },
                "lowercase_foo": {
                    "type": "text",
                    "analyzer": "keylower"
                }
            }
        }
    }
}

PUT /test-idx/doc/1
{"foo": "QNMZ-1900"}

Check resoults of these two searches. First will resoult one hit. Second one will return 0 hits.

GET /test-idx/doc/_search
{
  "query": {
     "wildcard" : { "foo.lowercase_foo" : "qnmz-19*" }
  }
}

GET /test-idx/doc/_search
{
  "query": {
     "wildcard" : { "foo" : "qnmz-19*" }
  }
}

Thanks @ThomasC for an opinion. Please be carefull with my answer. I am just learning Elasticsearch. I am not an expert in this database. I don't know is it production ready advice!

Upvotes: 0

ThomasC
ThomasC

Reputation: 8165

One solution is to define a custom analyzer using

  • a keyword tokenizer (which keeps the input value as it is, as if it was not_analyzed)
  • a lowercase tokenfilter

I've tried this :

POST test-idx
{
  "index":{
    "analysis":{
      "analyzer":{
        "lowercase_hyphen":{
          "type":"custom",
          "tokenizer":"keyword",
          "filter":["lowercase"]
        }
      }
    }
  }
}

PUT test-idx/doc/_mapping
{
  "doc":{
    "properties": {
        "foo" : {
          "type": "string",
          "analyzer": "lowercase_hyphen"
        }
    }      
  }
}

POST test-idx/doc
{
  "foo":"QNMZ-1900"
}

As you can see using the _analyze endpoint like this :

GET test-idx/_analyze?analyzer=lowercase_hyphen&text=QNMZ-1900

outputs only one token lowercased but not split on hyphens :

{
   "tokens": [
      {
         "token": "qnmz-1900",
         "start_offset": 0,
         "end_offset": 9,
         "type": "word",
         "position": 1
      }
   ]
}

Then, using the same query :

POST test-idx/doc/_search
{
  "query": {
    "wildcard" : { "foo" : "qnmz-19*" }    
  }
}

I have this result, which is what you want:

{
   "took": 66,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "test-idx",
            "_type": "doc",
            "_id": "wo1yanIjQGmvgfScMg4hyg",
            "_score": 1,
            "_source": {
               "foo": "QNMZ-1900"
            }
         }
      ]
   }
}

However, please note that this will allow you to query only using lowercased value. As stated by Andrei in comment, the same query with value QNMZ-19* won't return anything.

The reason can be found in the documentation : at search time, the value isn't analyzed.

Upvotes: 8

Related Questions