Reputation: 2073
I need some help or an idea for the correct procedure.
I already indexed a big vaste of documents. Now I found out that there are some documents with almost the same content, f.e.
{
"title": "myDocument",
"date": "2017-09-18",
"page": 1
}
{
"title": "myDocument",
"date": "2017-09-18",
"page": 2
}
The title field is mapped as text, date is date and page is integer. As you can see the only difference is the page value.
Now I want to make a query and filter out these duplicates. Field collapsing seems a good way to do it but in this case I can't get the correct count of results and that's important for me.
An other way would be to get all results first and then filter out "manually" but then I have a problem with pagination.
Upvotes: 0
Views: 591
Reputation: 1251
Try something like this.
GET index/type/_search
{
"aggs": {
"count_by_title_date_page":{
"terms": {
"field": "title.keyword",
"size": 100
},
"aggs": {
"date": {
"terms": {
"field": "date.keyword",
"size": 100
},
"aggs": {
"page": {
"terms": {
"field": "page.keyword",
"size": 100
}
}
}
}
}
}
}
}
Upvotes: 0