kgangadhar
kgangadhar

Reputation: 5088

Ignore special characters from the stored data while doing search query in Elastic search

I have my elastic search data stored in the following format:

{
    "person_name": "Abraham Benjamin deVilliers",
    "name": "Abraham",
    "office": {
        "name": "my_office"
    }
},
{
    "person_name": "Johnny O'Ryan",
    "name": "O'Ryan",
    "office": {
        "name": "Johnny O'Ryan"
    }
},
......

And i have match query to search based on person_name,name and office.name as follows:

{
  "query": {
    "multi_match" : {
      "query":      "O'Ryan",
      "type":       "best_fields",
      "fields":     [ "person_name", "name", "office.name" ],
      "operator":"and"
    }
  }
}

And its working fine and i am getting expected result for query fields which are exactly matching the name or person_name or office.name as below.

{
    "person_name": "Johnny O'Ryan",
    "name": "O'Ryan",
    "office": {
        "name": "Johnny O'Ryan"
    }
}

Now i want to enable the search to return the same response when the user passes the query field ORyan instead if O'Ryan, ignoring the Single quote (') from the stored result.

Is there is a way to do it while doing elastic search query or do i need to ignore the special characters while storing the data in elastic search ?.

Any help will be appreciated.

Upvotes: 1

Views: 229

Answers (1)

Felipe Plazas
Felipe Plazas

Reputation: 334

What you are looking for is a tokenizer: Tokenizers

In your case, you can try something like

GET /_analyze
{
  "tokenizer": "letter", 
  "filter":[],
  "text" : "O'Ryan is good"
}

It will produce the following tokens:

{
  "tokens": [
    {
      "token": "O",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "Ryan",
      "start_offset": 2,
      "end_offset": 6,
      "type": "word",
      "position": 1
    },
    {
      "token": "is",
      "start_offset": 7,
      "end_offset": 9,
      "type": "word",
      "position": 2
    },
    {
      "token": "good",
      "start_offset": 10,
      "end_offset": 14,
      "type": "word",
      "position": 3
    }
  ]
}

Update:

You could also add a Mapping Char Filter to the analyzer used on the name fields (or whatever field in which the single quotation is a problem:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "' => "
          ]
        }
      }
    }
  }
}

If you run:

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "O'Bryan is a good"
}

You will get:

{
  "tokens": [
    {
      "token": "OBryan",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 8,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 11,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "good",
      "start_offset": 13,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Upvotes: 1

Related Questions