Reputation: 2437
I have an aggregate query to make which buckets the city name of a country. The query (which I make in sense) is as below:
GET test/_search
{
"query" : {
"bool" : {
"must" : {
"match" : {
"name.autocomplete" : {
"query" : "new yo",
"type" : "boolean"
}
}
},
"must_not" : {
"term" : {
"source" : "old"
}
}
}
},
"aggregations" : {
"city_name" : {
"terms" : {
"field" : "cityname.raw",
"min_doc_count" : 1
},
"aggregations" : {
"country_name" : {
"terms" : {
"field" : "countryname.raw"
}
}
}
}
}
}
Now in the documents New York
occurs two time one with an extra trailing space. The aggregation result which I get is as below:
{
"key": "New York",
"doc_count": 1,
"city_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "United States of America",
"doc_count": 1
}
]
}
},
{
"key": "New York ",
"doc_count": 1,
"city_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "United States of America",
"doc_count": 1
}
]
}
}
I need the both New York
to be treated the same. Is there any way I can query that I get both of them in the same group. Any things which trims the trailing spaces will do I guess. Could not find anything though. Thanks
Upvotes: 0
Views: 2563
Reputation: 217424
The ideal case is to clean up your fields before indexing your documents. If that's not an option, you can still clean them after the fact using (e.g.) the update-by-query plugin...
Or, but that's a bit worse performance-wise, use a terms
aggregation with a script
instead of a field
, like this:
...
"aggregations" : {
"city_name" : {
"terms" : {
"script" : "doc['cityname.raw'].value.trim()",
"min_doc_count" : 1
},
"aggregations" : {
"country_name" : {
"terms" : {
"script" : "doc['countryname.raw'].value.trim()",
}
}
}
}
}
}
Yet another solution would be to change from not_analyzed
to an analyzed
string but create a custom analyzer that preserves the token (as not_analyzed
does) using the keyword
analyzer with a trim
token filter.
{
"settings": {
"analysis": {
"analyzer": {
"trimmer": {
"type": "custom",
"filter": [ "trim" ],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"test": {
"properties": {
"cityname": {
"type": "string",
"analyzer": "trimmer"
},
"countryname": {
"type": "string",
"analyzer": "trimmer"
}
}
}
}
}
If you index cityname: "New York City "
the token that is going to be stored will be trimmed to "New York City"
Upvotes: 2