Reputation: 3262
I am using elasticsearch 7.8 and I have the entries in the index like below,
{"_id" : 1,"sourceip":"1.1.1.1", "data" : "this is a sample input", "processedflag" : true}
{"_id" : 2,"sourceip":"1.1.1.1", "data" : "this is a sample input", "processedflag" : false}
{"_id" : 3,"sourceip":"1.1.1.1", "data" : "this is an another input", "processedflag" : false}
{"_id" : 4,"sourceip":"1.1.1.2", "data" : "this is a sample input", "processedflag" : false}
Now for the sourceip : 1.1.1.1, I want to aggregate and find the duplicates of "data",
For example in the above case, I want to get the _id of 1 and 2 entries since the data is matched.
Thanks,
Harry
Upvotes: 1
Views: 315
Reputation: 8840
Looking at your data, I've only considered the first three fields and based on it, created the mapping, documents, query and response.
PUT my_ip_index
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"sourceip":{
"type": "ip"
},
"data":{
"type": "keyword" <----- Notice this though
}
}
}
}
POST my_ip_index/_doc/1
{
"id": 1,
"sourceip": "1.1.1.1",
"data": "this is a sample input"
}
POST my_ip_index/_doc/2
{
"id": 2,
"sourceip": "1.1.1.1",
"data": "this is a sample input"
}
POST my_ip_index/_doc/3
{
"id": 3,
"sourceip": "1.1.1.1",
"data": "this is an another input"
}
POST my_ip_index/_doc/4
{
"id": 4,
"sourceip": "1.1.1.2",
"data": "this is a sample input"
}
POST my_ip_index/_doc/5
{
"id": 5,
"sourceip": "1.1.1.2",
"data": "this is a sample another input"
}
Only the first two documents are equal i.e. having same ip
as well as data
POST my_ip_index/_search
{
"size": 0,
"aggs": {
"my_ip_address": {
"terms": {
"field": "sourceip",
"min_doc_count": 2 <---- Note this
},
"aggs": {
"my_data": {
"terms": {
"field": "data",
"min_doc_count": 2 <---- Note this
},
"aggs": {
"my_duplicate_ids":{
"terms": {
"field": "id",
"size": 10
}
}
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "my_data._bucket_count"
},
"script": {
"source": "params.count > 0"
}
}
}
}
}
}
}
Note that I've made use of the below aggregations and notice in particular the structure
sourceip
field
data
field
id
fieldAlso notice how I've made use of bucket_count special path in the bucket script aggregation part.
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_ip_address" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1.1.1.1", <---- IP
"doc_count" : 3,
"my_data" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "this is a sample input", <---- data
"doc_count" : 2,
"my_duplicate_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1", <---- id you are looking for
"doc_count" : 1
},
{
"key" : "2", <---- id you are looking for
"doc_count" : 1
}
]
}
}
]
}
}
]
}
}
}
Hope that helps!
Upvotes: 1