Reputation: 11
We have groups of logs in our Elasticsearch, each group contains 1-7 logs that share a unique ID (named transactionId). Every log in every group has a unique timestamp (eventTimestamp).
For example:
{
"transactionId": "id111",
"eventTimestamp": "1505864112047",
"otherfieldA": "fieldAvalue",
"otherfieldB": "fieldBvalue"
}
{
"transactionId": "id111",
"eventTimestamp": "1505864112051",
"otherfieldA": "fieldAvalue",
"otherfieldB": "fieldBvalue"
}
{
"transactionId": "id222",
"eventTimestamp": "1505863719467",
"otherfieldA": "fieldAvalue",
"otherfieldB": "fieldBvalue"
}
{
"transactionId": "id222",
"eventTimestamp": "1505863719478",
"otherfieldA": "fieldAvalue",
"otherfieldB": "fieldBvalue"
}
I need to write a query that returns all of the latest timestamps for all of the transactionIds in a certain date range.
Continuing with my simplistic example, the result of the query should return these logs:
{
"transactionId": "id111",
"eventTimestamp": "1505864112051",
"otherfieldA": "fieldAvalue",
"otherfieldB": "fieldBvalue"
}
{
"transactionId": "id222",
"eventTimestamp": "1505863719478",
"otherfieldA": "fieldAvalue",
"otherfieldB": "fieldBvalue"
}
Any ideas on how to build a query that accomplishes this?
Upvotes: 1
Views: 258
Reputation: 12240
You can get the desired result not with a query itself but with a combination of a terms aggregation and a nested top hits aggregation.
The terms aggregation is responsible for building buckets where all items with the same term are in the same bucket. This is what can generate your groups per transactionId
. The top hits aggregation then is a metric aggregation that can be configured to return the x top hits of a bucket according to a given sort order. This allows you to retrieve the log event with the largest timestamp of each bucket.
Assuming the default mapping of your sample data (where strings are indexed as thekey (text) and thekey.keyword (as non-analyzed text)) this query:
GET so-logs/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"eventTimestamp.keyword": {
"gte": 1500000000000,
"lte": 1507000000000
}
}
}
]
}
},
"aggs": {
"by_transaction_id": {
"terms": {
"field": "transactionId.keyword",
"size": 10
},
"aggs": {
"latest": {
"top_hits": {
"size": 1,
"sort": [
{
"eventTimestamp.keyword": {
"order": "desc"
}
}
]
}
}
}
}
}
}
will produce the following output:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"by_transaction_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "id111",
"doc_count": 2,
"latest": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "so-logs",
"_type": "entry",
"_id": "AV6z9Yj4QYbhNp_FoXa1",
"_score": null,
"_source": {
"transactionId": "id111",
"eventTimestamp": "1505864112051",
"otherfieldA": "fieldAvalue",
"otherfieldB": "fieldBvalue"
},
"sort": [
"1505864112051"
]
}
]
}
}
},
{
"key": "id222",
"doc_count": 2,
"latest": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "so-logs",
"_type": "entry",
"_id": "AV6z9ZlOQYbhNp_FoXa4",
"_score": null,
"_source": {
"transactionId": "id222",
"eventTimestamp": "1505863719478",
"otherfieldA": "fieldAvalue",
"otherfieldB": "fieldBvalue"
},
"sort": [
"1505863719478"
]
}
]
}
}
}
]
}
}
}
where you can find the desired results inside of the aggregation results by_transaction_id.latest
according to the aggregation names defined in the query.
Please be aware that the terms aggregation has a limit on how many buckets are returned and setting this to say >10.000 is probably not a clever idea from a performance perspective. For details, see the section on size
of the terms aggregation. If you want to deal with huge amounts of different transaction ids I would suggest to do some redundant storage of the "top" entry by transaction id.
In addition, you should probably switch the eventTimestamp
field to date
for better performance and a wider set of query possibilities.
Upvotes: 2