hanetz
hanetz

Reputation: 323

Elasticsearch aggregations on nested inner hits

I got a large amount of data in Elasticsearch. My douments have a nested field called "records" that contains a list of objects with several fields.

I want to be able to query specific objects from the records list, and therefore I use the inner_hits field in my query, but It doesn't help because aggregation uses size 0 so no results are returned.

I didn't succeed to make an aggregation work only for inner_hits, as aggregation returns results for all the objects inside records no matter the query.

This is the query I am using: (Each document has first_timestamp and last_timestamp fields, and each object in the records list has a timestamp field)

curl -XPOST 'localhost:9200/_msearch?pretty' -H 'Content-Type: application/json' -d'    
{
    "index":[
        "my_index"
    ],
    "search_type":"count",
    "ignore_unavailable":true
}
{
    "size":0,
    "query":{
        "filtered":{
             "query":{
                 "nested":{
                     "path":"records",
                     "query":{
                         "term":{
                             "records.data.field1":"value1"
                         }
                     },
                     "inner_hits":{}
                 }
             },
             "filter":{
                 "bool":{
                     "must":[
                     {
                         "range":{
                             "first_timestamp":{
                                 "gte":1504548296273,
                                 "lte":1504549196273,
                                 "format":"epoch_millis"
                             }
                         }
                     }
                     ],
                 }
             }
         }
     },
     "aggs":{
         "nested_2":{
             "nested":{
                 "path":"records"
             },
             "aggs":{
                 "2":{
                     "date_histogram":{
                          "field":"records.timestamp",
                          "interval":"1s",
                          "min_doc_count":1,
                          "extended_bounds":{
                              "min":1504548296273,
                              "max":1504549196273
                          }
                     }
                }
           }
      }
   }
}'

Upvotes: 11

Views: 11975

Answers (3)

Musab Dogan
Musab Dogan

Reputation: 3580

You can also check the code like this

PUT records
{
  "mappings": {
    "properties": {
      "records": {
        "type": "nested"
      }
    }
  }
}

POST records/_doc
{
  "records": [
    {
      "data": "test1",
      "value": 1
    },
    {
      "data": "test2",
      "value": 2
    }
  ]
}

GET records/_search
{
  "size": 0,
  "aggs": {
    "all_nested_count": {
      "nested": {
        "path": "records"
      },
      "aggs": {
        "bool_aggs": {
          "filter": {
            "bool": {
              "must": [
                {
                  "term": {
                    "records.data": "test2"
                  }    
                }
              ]
            }
          },
          "aggs": {
            "filtered_aggs": {
              "sum": {
                "field": "records.value"
              }
            }
          }
        }
      }
    }
  }
}

Ref: https://www.elastic.co/guide/en/elasticsearch/reference/current/inner-hits.html

enter image description here

Upvotes: 0

Saket Gupta
Saket Gupta

Reputation: 453

Inner_hits aggregation is not supported by elasticsearch. The reason behind it is that inner_hits is a very expensive operation and applying aggregation on inner_hits is like exponential increase in complexity of operation. Here is the github link of the issue.

If you want aggregation on inner_hits you can probably use the following approach:

  1. Make flexible query where you only get the required hit from elastic and aggregate over it. Repeat it multiple time to get all the hits and aggregate simultaneously. This approach may lead you with multiple search query which is not advisable.
  2. You can make your application layer handle the aggregation logic by writing smart aggregation parser and run those parser on response from elasticsearch. This approach is a little better but you have an overhead of maintaining the parser according to changing needs.

I would personally recommend you to change your data-mapping style in elasticsearch so that you are able to run aggregation on it.

Upvotes: 3

Eli
Eli

Reputation: 4926

Your query is pretty complex. To be short, here is your requested query:

{
  "size": 0,
  "aggregations": {
    "nested_A": {
      "nested": {
        "path": "records"
      },
      "aggregations": {
        "bool_aggregation_A": {
          "filter": {
            "bool": {
              "must": [
                {
                  "term": {
                    "records.data.field1": "value1"
                  }    
                }
              ]
            }
          },
          "aggregations": {
            "reverse_aggregation": {
              "reverse_nested": {},
              "aggregations": {
                "bool_aggregation_B": {
                  "filter": {
                    "bool": {
                      "must": [
                        {
                          "range": {
                            "first_timestamp": {
                              "gte": 1504548296273,
                              "lte": 1504549196273,
                              "format": "epoch_millis"
                            }
                          }
                        }
                      ]
                    }
                  },
                  "aggregations": {
                    "nested_B": {
                      "nested": {
                        "path": "records"
                      },
                      "aggregations": {
                        "my_histogram": {
                          "date_histogram": {
                            "field": "records.timestamp",
                            "interval": "1s",
                            "min_doc_count": 1,
                            "extended_bounds": {
                              "min": 1504548296273,
                              "max": 1504549196273
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Now, let me explain every step by aggregations' names:

  • size: 0 -> we are not interested in hits, only aggregations
  • nested_A -> data.field1 is under records so we dive our scope to records
  • bool_aggregation_A -> filter by data.field1: value1
  • reverse_aggregation -> first_timestamp is not in nested document, we need to scope out from records
  • bool_aggregation_B -> filter by first_timestamp range
  • nested_B -> now, we scope again into records for timestamp field (located under records)
  • my_histogram -> finally, aggregate date histogram by timestamp field

Upvotes: 22

Related Questions