Ramakrishna Reddy
Ramakrishna Reddy

Reputation: 347

Match query with multiple words matching in the field

How to apply match query on the field which has multiple keywords matching like "los angeles" has two words in it. How to match it from the below data structure

  "addresses" : [
        {
          "type" : "Home",
          "address" : "Los Angeles,CA,US"
        }
      ] 

Below are my mappings and settings, created custom settings and filters

 PUT /test
 {
   "settings": {
     "analysis": {
       "filter" : {
             "my_word_delimiter" : {
                 "type" : "word_delimiter",
                 "type_table": [
                   "# => ALPHANUM",
                   "+ => ALPHANUM",
                   "@ => ALPHANUM",
                          "% => ALPHANUM",
                          "~ => ALPHANUM",
                          "^ => ALPHANUM",
                          "$ => ALPHANUM",
                          "& => ALPHANUM",
                          "' => ALPHANUM",
                          "\" => ALPHANUM",
                          "\/ => ALPHANUM",
                          ", => ALPHANUM"
                 ],
                 "preserve_original": "true",
                 "generate_word_parts":false,
                 "generate_number_parts":false,
                 "split_on_case_change":false,
                 "split_on_numerics":false,
                 "stem_english_possessive":false
             }   
         },
       "analyzer": {
             "default": {
                "type": "custom",
                "tokenizer": "whitespace",
                "filter": [
                   "lowercase",
                   "my_word_delimiter"
                ]
             }
          },
       "normalizer": {
         "keyword_lowercase": {
           "type": "custom",
           "filter": [
             "lowercase"
           ]
         }
       }
     }
   },
   "mappings": {
     "dynamic": "strict",
     "properties": {
      "addresses": {
         "type": "nested",
         "properties": {
           "address": {
             "type": "text"
           },
           "type": {
             "type": "keyword"
           }
         }
       }
     }
   }
 }

tried with the below query but not getting the results

 {
   "from": "0",
   "size": "30",
   "query": {
     "bool": {
       "must": [
         {
           "bool": {
             "should": [
               {
                 "nested": {
                   "path": "addresses",
                   "query": {
                     "match": {
                       "addresses.address": {
                         "query": "Los Angeles",
                         "operator": "and"
                       }
                     }
                   }
                 }
               }
             ]
           }
         }
       ]
     }
   },
   "sort": [
     {
       "_score": {
         "order": "desc"
       }
     }
   ]
 }

Is there any problem with the settings created

Upvotes: 0

Views: 241

Answers (1)

Bhavya
Bhavya

Reputation: 16192

You are not getting results in case the address has value like "Los Angeles,CA,US", because you are using whitespace tokenizer.

The whitespace tokenizer breaks text into terms whenever it encounters a whitespace character.

Since you are using and operator with match query, so the query should retrieve that data which have both Los and Angeles, but due to whitespace tokenizer, no token for Angeles is generated, therefore no results are returned.

 POST/_analyze
    {
      "tokenizer": "whitespace",
      "text": "Los Angeles,CA,US"
    }

The tokens are:

    {
  "tokens": [
    {
      "token": "Los",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "Angeles,CA,US",
      "start_offset": 4,
      "end_offset": 17,
      "type": "word",
      "position": 1
    }
  ]
}

But in the case of "Los Angeles ,CA,US", since there is a whitespace after Angeles, so the tokens generated are: Los, Angeles, ,CA,US

Adding a working example with index data, mapping, and search result

Index Mapping:

Keep the mapping same, apart from changing from whitespace to"tokenizer":"standard"

Analyze API

The standard tokenizer provides grammar-based tokenization

{
  "tokenizer": "standard",
  "text": "Los Angeles ,CA,US"
}

The tokens are:

{
  "tokens": [
    {
      "token": "Los",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Angeles",
      "start_offset": 4,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "CA",
      "start_offset": 13,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "US",
      "start_offset": 16,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Index Data:

{
  "addresses": [
    {
      "type": "Home",
      "address": "Los Angeles,CA,US"
    }
  ]
}

Using the same search query as given in the result

Search Result:

"hits": [
      {
        "_index": "64624353",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.26706278,
        "_source": {
          "addresses": [
            {
              "type": "Home",
              "address": "Los Angeles,CA,US"
            }
          ]
        }
      }
    ]

NOTE: If you want to use whitespace tokenizer, then remove "operator": "and" from the search query, that you will get the required result

Update 1:

Try using this updated mapping:

{
  "settings": {
    "analysis": {
      "filter": {
        "my_word_delimiter": {
          "type": "word_delimiter",
          "type_table": [
            "# => ALPHANUM",
            "+ => ALPHANUM",
            "@ => ALPHANUM",
            "% => ALPHANUM",
            "~ => ALPHANUM",
            "^ => ALPHANUM",
            "$ => ALPHANUM",
            "& => ALPHANUM",
            "' => ALPHANUM",
            "\" => ALPHANUM",
            "\/ => ALPHANUM"
          ],
          "preserve_original": "true",
          "generate_word_parts": true,
          "generate_number_parts": false,
          "split_on_case_change": false,
          "split_on_numerics": false,
          "stem_english_possessive": false
        }
      },
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "my_word_delimiter"
          ]
        }
      },
      "normalizer": {
        "keyword_lowercase": {
          "type": "custom",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "addresses": {
        "type": "nested",
        "properties": {
          "address": {
            "type": "text"
          },
          "type": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
  1. generate_word_parts is set to true so that the filter includes tokens consisting of alphabetical characters in the output.
  2. Word delimiter token filter, splits tokens at non-alphanumeric characters. Have removed ", => ALPHANUM", from type_table

Upvotes: 1

Related Questions