Reputation: 2933
Please observe this secenario:
Define mappings
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Add data
PUT /my_index/my_type/1
{
"city": "New York"
}
PUT /my_index/my_type/2
{
"city": "York"
}
PUT /my_index/my_type/3
{
"city": "york"
}
Query for facets
GET /my_index/_search
{
"size": 0,
"aggs": {
"Cities": {
"terms": {
"field": "city.raw"
}
}
}
}
Result
{
...
"aggregations": {
"Cities": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "New York",
"doc_count": 1
},
{
"key": "York",
"doc_count": 1
},
{
"key": "york",
"doc_count": 1
}
]
}
}
}
Dilemma
I would like to 2 thing:
Dream result
{
...
"aggregations": {
"Cities": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "New York",
"doc_count": 1
},
{
"key": "York",
"doc_count": 2
}
]
}
}
}
Upvotes: 4
Views: 2128
Reputation: 8718
It's going to make your client-side code slightly more complicated, but you could always do something like this.
Set up the index with an additional sub-field that is only lower-cased (not split on white space):
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "string",
"fields": {
"lowercase": {
"type": "string",
"analyzer": "lowercase_analyzer"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
PUT /my_index/my_type/_bulk
{"index":{"_id":1}}
{"city":"New York"}
{"index":{"_id":2}}
{"city":"York"}
{"index":{"_id":3}}
{"city":"york"}
Then use a two-level aggregation like this, where the second orders alphabetically ascending (so that upper-case term will come first) and only returns the top raw term for each lower-case term:
GET /my_index/_search
{
"size": 0,
"aggs": {
"city_lowercase": {
"terms": {
"field": "city.lowercase"
},
"aggs": {
"city_terms": {
"terms": {
"field": "city.raw",
"order" : { "_term" : "asc" },
"size": 1
}
}
}
}
}
}
which returns:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"city_lowercase": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "york",
"doc_count": 2,
"city_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 1,
"buckets": [
{
"key": "York",
"doc_count": 1
}
]
}
},
{
"key": "new york",
"doc_count": 1,
"city_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "New York",
"doc_count": 1
}
]
}
}
]
}
}
}
Here's the code I used (with a few more doc examples):
http://sense.qbox.io/gist/f3781d58fbaadcc1585c30ebb087108d2752dfff
Upvotes: 3