Nihal Sharma
Nihal Sharma

Reputation: 2437

How to ignore trailing white-spaces while making an aggregation query in ElasticSearch

I have an aggregate query to make which buckets the city name of a country. The query (which I make in sense) is as below:

GET test/_search
{

  "query" : {
"bool" : {
  "must" : {
    "match" : {
      "name.autocomplete" : {
        "query" : "new yo",
        "type" : "boolean"
      }
    }
  },
  "must_not" : {
    "term" : {
      "source" : "old"
    }
  }
}
  },
  "aggregations" : {
"city_name" : {
  "terms" : {
    "field" : "cityname.raw",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "field" : "countryname.raw"
         }
       }
     }
   }
 }
}

Now in the documents New Yorkoccurs two time one with an extra trailing space. The aggregation result which I get is as below:

{
     "key": "New York",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  },
  {
     "key": "New York ",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  }

I need the both New York to be treated the same. Is there any way I can query that I get both of them in the same group. Any things which trims the trailing spaces will do I guess. Could not find anything though. Thanks

Upvotes: 0

Views: 2563

Answers (1)

Val
Val

Reputation: 217424

The ideal case is to clean up your fields before indexing your documents. If that's not an option, you can still clean them after the fact using (e.g.) the update-by-query plugin...

Or, but that's a bit worse performance-wise, use a terms aggregation with a script instead of a field, like this:

...
"aggregations" : {
"city_name" : {
  "terms" : {
    "script" : "doc['cityname.raw'].value.trim()",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "script" : "doc['countryname.raw'].value.trim()",
         }
       }
     }
   }
 }
}

Yet another solution would be to change from not_analyzed to an analyzed string but create a custom analyzer that preserves the token (as not_analyzed does) using the keyword analyzer with a trim token filter.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "trimmer": {
          "type": "custom",
          "filter": [ "trim" ],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "cityname": {
          "type": "string",
          "analyzer": "trimmer"
        },
        "countryname": {
          "type": "string",
          "analyzer": "trimmer"
        }
      }
    }
  }
}

If you index cityname: "New York City " the token that is going to be stored will be trimmed to "New York City"

Upvotes: 2

Related Questions