Rio
Rio

Reputation: 365

case insensitive elasticsearch with uppercase or lowercase

I am working with elastic search and I am facing a problem. if any body gave me a hint , I will really thankful.

I want to analyze a field "name" or "description" which consist of different entries . e.g someone want to search Sara. if he enter SARA, SAra or sara. he should be able to get Sara. elastic search uses analyzer which makes everything lowercase.

I want to implement it case insensitive regardless of user input uppercase or lowercase name, he/she should get results. I am using ngram filter to search names and lowercase which makes it case insensitive. But I want to make sure that a person get results if even he enters in uppercase or lowercase.

Is there any way to do this in elastic search?

{"settings": {

        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 1,
                    "max_gram": 80
                }
            },
            "analyzer": {
                "index_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "ngram_filter", "lowercase" ]
                },

I attach the example.js file which include json example and search.txt file to explain my problem . I hope my problem will be more clear now. this is the link to onedrive where I kept both files. https://1drv.ms/f/s!AsW4Pb3Y55Qjb34OtQI7qQotLzc

Upvotes: 5

Views: 22889

Answers (3)

Binita Bharati
Binita Bharati

Reputation: 5888

This answer is in context of ElasticSearch 7.14. So, let me re-format the ask of this question in another way:

Irrespective of the actual case type provided in the match query, you would like to be able to get those documents that have been analysed with :

   "tokenizer": "keyword",
   "filter": [ "ngram_filter", "lowercase" ]

Now, coming to the answer part:

It will not be possible to get the match query to return the docs that have been analysed with filter lowercase and the match query contains uppercase letters. The analysis that you have applied in the settings is applicable both while updating and searching data. Although, it is also possible to apply different analysers for updating and searching, I do not see that helping your case. You would have to convert the match query value to lowercase before making the query. So, if your filter is lowercase, you can not match by say Sara or SARA or sAra etc. The match param should be all lowercase, just as it is in your analyser.

Upvotes: 0

Adam Łepkowski
Adam Łepkowski

Reputation: 2078

The analysis process is executed for full-text search fields (analysed) twice: first when data are stored and the second time when you search. It’s worth to say that input JSON will be returned in the same shape as an output from a search query. The analysis process is only used to create tokens for an inverted index. Key to your solution are the following steps:

  1. Create two analysers one with ngram filter and second analyser without ngram filter because you don’t need to analyse input search query using ngram because you have an exact value that you want to search.
  2. Define mappings correctly for your fields. There are two fields in the mapping that allow you to specify analysers. One is used for storage (analyzer) and second, is used for searching (search_analyzer) – if you specified only analyser field then specified analyser is used for index and search time.

You can read more about it here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

And your code should look like that:

PUT /my_index
{
   "settings": {
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "ngram",
               "min_gram": 1,
               "max_gram": 5
            }
         },
         "analyzer": {
            "index_store_ngram": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "ngram_filter",
                  "lowercase"
               ]
            }
         }
      }
   },
   "mappings": {
      "my_type": {
         "properties": {
            "name": {
               "type": "string",
               "analyzer": "index_store_ngram",
               "search_analyzer": "standard"
            }
         }
      }
   }
}

post /my_index/my_type/1
{
     "name": "Sara_11_01"
}

GET /my_index/my_type/_search
{
    "query": {
        "match": {
           "name": "sara"
        }
    }
}

GET /my_index/my_type/_search
{
    "query": {
        "match": {
           "name": "SARA"
        }
    }
}

GET /my_index/my_type/_search
{
    "query": {
        "match": {
           "name": "SaRa"
        }
    }
}

Edit 1: updated code for a new example provided in the question

Upvotes: 1

jay
jay

Reputation: 2077

Is there any specific reason you are using ngram? Elasticsearch uses the same analyzer on the "query" as well as the text you index - unless search_analyzer is explicitly specified, as mentioned by @Adam in his answer. In your case it might be enough to use a standard tokenizer with a lowercase filter

I created an index with the following settings and mapping:

{
   "settings": {
      "analysis": {
         "analyzer": {
            "custom_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase"
               ]
            }
         }
      }
   },
   "mappings": {
      "typehere": {
         "properties": {
            "name": {
               "type": "string",
               "analyzer": "custom_analyzer"
            },
            "description": {
               "type": "string",
               "analyzer": "custom_analyzer"
            }
         }
      }
   }
}

Indexed two documents Doc 1

PUT /test_index/test_mapping/1
    {
        "name" : "Sara Connor",
        "Description" : "My real name is Sarah Connor."
    }

Doc 2

PUT /test_index/test_mapping/2
    {
        "name" : "John Connor",
        "Description" : "I might save humanity someday."
    }

Do a simple search

POST /test_index/_search?query=sara
{
    "query" : {
        "match" : {
            "name" : "SARA"
        }
    }
}

And get back only the first document. I tried with "sara" and "Sara" also, same results.

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.19178301,
    "hits": [
      {
        "_index": "test_index",
        "_type": "test_mapping",
        "_id": "1",
        "_score": 0.19178301,
        "_source": {
          "name": "Sara Connor",
          "Description": "My real name is Sarah Connor."
        }
      }
    ]
  }
}

Upvotes: 1

Related Questions