Reputation: 83

Elasticsearch date histogram aggregation on a duration of time

The documents that I deal with in Elasticsearch have the concept of duration represented as start and end time, e.g.

{
  issueId: 1,
  issuePriority: 3,
  timeWindow: {
    start: "2015-10-14T17:00:00-07:00",
    end: "2015-10-14T18:00:00-07:00"
  }
},
{
  issueId: 2,
  issuePriority: 1,
  timeWindow: {
    start: "2015-10-14T16:50:00-07:00",
    end: "2015-10-14T17:50:00-07:00"
  }
}

My goal is to produce a histogram where number of issues and their max priority are aggregated into 15 minute buckets. So for the example above issue #1 will be bucketized into the 17:00, 17:15, 17:30, and 17:45 buckets, no more, no less.

I tried using the date_histogram aggregation, e.g:

aggs: {
  max_priority_over_time: {
    date_histogram: {
      field: "timeWindow.start",
      interval: "15minute",
    },
    aggs: {
      max_priority: ${top_hits_aggregation}
    }
  }
}

but obviously it is only bucketizing issue #1 into the 17:00 bucket. Even if I were to take timeWindow.end into account it would only be added to the 18:00 bucket. Does anyone know how I can accomplish this using the date_histogram or other Elasticsearch aggregations? Potentially generating a range of timestamps 15 minutes apart from timeWindow.start to timeWindow.end so that they can be bucketized correctly. Thanks.

Upvotes: 3

Answers (3)

Vineeth Mohan

Reputation: 19283

You will need to use a script for this. Create a script that emits an array of dates. These dates should start from the start date and each should increment by 15 minutes ( Assuming 15 minutes is the interval ). Now place this script in the script option of date_histogram. So essentially the script should do the following -

start=2015-10-14T17:00:00-07:00
end=2015-10-14T18:00:00-07:00"
Output of script = [ "2015-10-14T17:00:00-07:00" , "2015-10-14T17:15:00-07:00" , "2015-10-14T17:30:00-07:00" , "2015-10-14T17:45:00-07:00" ,  "2015-10-14T18:00:00-07:00" ]

To lean more on scripting you can go through this Elasticsearch documentations. These blogs might also be useful - This , this and this.

Upvotes: 1

Jason Cheok Wan

Reputation: 83

Ok, since the timestamps for my data are always truncated to the nearest 10 minutes, I figured I can use a nested terms aggregation instead:

aggs: {
  per_start_time: {
    terms: {
      field: "timeWindow.start"
    },
    aggs: {
      per_end_time: {
        terms: {
          field: "timeWindow.end"
        },
        aggs: {
          max_priority: ${top_hits_aggregation}
        }
      }
    }
  }
}

this gives me a nested bucket per start_time per end_time, e.g:

{
  "key": 1444867800000,
  "key_as_string": "2015-10-15T00:10:00.000Z",
  "doc_count": 11,
  "per_end_time": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": 1444871400000,
        "key_as_string": "2015-10-15T01:10:00.000Z",
        "doc_count": 11,
        "max_priority": {
          "hits": {
            "total": 11,
            "max_score": 4,
          }
        }
      }
    ]
  }
}

by trimming down the buckets in our backend (ruby on rails), I could get the following results:

[
  {
    "start_time": "2015-10-14 14:40:00 -0700",
    "end_time": "2015-10-14 15:40:00 -0700",
    "max_priority": 4,
    "count": 12
  }
],
[
  {
    "start_time": "2015-10-14 14:50:00 -0700",
    "end_time": "2015-10-14 15:50:00 -0700",
    "max_priority": 4,
    "count": 12
  }
],
...

which can be map/reduced further into a date histogram with arbitrary time buckets, outside of elasticsearch of course. If timeWindow.start, timeWindow.end and the window duration are completely arbitrary in time, I guess it'd be equivalent of just fetching everything and doing the counting in backend (since it's almost generating one nested time bucket per document), fortunately the timestamps that I deal with are somewhat predictable so I can take this hybrid approach.

Upvotes: 1

solarissmoke

Reputation: 31494

By definition a bucketing operation will put each object returned by your query into one bucket and one only, i.e., you can't have it put the same object in multiple buckets at the same time in one query.

If I understand your problem correctly then you need to do a series of queries applying a range filter to get the number of issues in each 15 minute interval. So for each interval defined by you, you would get the issues that are open within that interval:

{
    "query": {
        "filtered": { 
            "filter": {
                "bool": {
                    "must": [
                        "range": {
                            "timeWindow.start" : {
                                "lte" : "2015-10-14T17:00:00-07:00"
                            }
                        },
                        "range": {
                            "timeWindow.end" : {
                                "gte" : "2015-10-14T17:15:00-07:00"
                            }
                        },
                    ]
                }
            }
        }
    }
}

(you would need to add your max_priority aggregation to the query).

The range queries will be cached by elasticsearch so this should be fairly efficient. Assuming your historic data does not change you would be able to cache the result of historic intervals in your application as well.

Upvotes: 1

Elasticsearch date histogram aggregation on a duration of time

Answers (3)

Related Questions