How to ignore trailing white-spaces while making an aggregation query in ElasticSearch

Question

I have an aggregate query to make which buckets the city name of a country. The query (which I make in sense) is as below:

GET test/_search
{

  "query" : {
"bool" : {
  "must" : {
    "match" : {
      "name.autocomplete" : {
        "query" : "new yo",
        "type" : "boolean"
      }
    }
  },
  "must_not" : {
    "term" : {
      "source" : "old"
    }
  }
}
  },
  "aggregations" : {
"city_name" : {
  "terms" : {
    "field" : "cityname.raw",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "field" : "countryname.raw"
         }
       }
     }
   }
 }
}

Now in the documents New Yorkoccurs two time one with an extra trailing space. The aggregation result which I get is as below:

{
     "key": "New York",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  },
  {
     "key": "New York ",
     "doc_count": 1,
     "city_name": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 0,
        "buckets": [
           {
              "key": "United States of America",
              "doc_count": 1
           }
        ]
     }
  }

I need the both New York to be treated the same. Is there any way I can query that I get both of them in the same group. Any things which trims the trailing spaces will do I guess. Could not find anything though. Thanks

Val · Accepted Answer

The ideal case is to clean up your fields before indexing your documents. If that's not an option, you can still clean them after the fact using (e.g.) the update-by-query plugin...

Or, but that's a bit worse performance-wise, use a terms aggregation with a script instead of a field, like this:

...
"aggregations" : {
"city_name" : {
  "terms" : {
    "script" : "doc['cityname.raw'].value.trim()",
    "min_doc_count" : 1
  },
     "aggregations" : {
      "country_name" : {
        "terms" : {
          "script" : "doc['countryname.raw'].value.trim()",
         }
       }
     }
   }
 }
}

Yet another solution would be to change from not_analyzed to an analyzed string but create a custom analyzer that preserves the token (as not_analyzed does) using the keyword analyzer with a trim token filter.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "trimmer": {
          "type": "custom",
          "filter": [ "trim" ],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "cityname": {
          "type": "string",
          "analyzer": "trimmer"
        },
        "countryname": {
          "type": "string",
          "analyzer": "trimmer"
        }
      }
    }
  }
}

If you index cityname: "New York City " the token that is going to be stored will be trimmed to "New York City"

How to ignore trailing white-spaces while making an aggregation query in ElasticSearch

Answers (1)

Related Questions