fedor-sg
fedor-sg

Reputation: 229

Search and aggregation on two indices

Two indexes are created with the dates.

First index mapping:

PUT /index_one
{
    "mappings": {
        "properties": {
            "date_start": {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
            }
        }
    }
}

Second index mapping:

PUT /index_two
{
    "mappings": {
        "properties": {
            "date_end": {
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
            }
        }
    }
}

Need to find a date in a certain range and perform aggregation average of the dates difference.

Tried to make a request like this:

GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
    "aggs": {
        "filtered_dates": {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "exists": {
                                "field": "date_start"
                            }
                        },
                        {
                            "exists": {
                                "field": "date_end"
                            }
                        }
                    ]
                }
            },
            "aggs": {
                "avg_date": {
                    "avg": {
                        "script": {
                            "lang": "painless",
                            "source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
                        }
                    }
                }
            }
        }
    }
}

I get the following response to the request:

{
    "hits": {
        "total": {
            "value": 16508
        },
        "hits": [
            {
                "_index": "index_one",
                "_type": "_doc",
                "_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
                "_score": 1.0,
                "_source": {
                    "date_begin": "2021-02-26 07:26:29.732+0300"
                }
            }
        ]
    },
    "aggregations": {
        "filtered_dates": {
            "meta": {},
            "doc_count": 0,
            "avg_date": {
                "value": null
            }
        }
    }
}

Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?

Upvotes: 0

Views: 1013

Answers (1)

Joe - Check out my books
Joe - Check out my books

Reputation: 16895

If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.


Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.

One option would be to use their ids:

POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }

POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }

POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }

POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }

After that, it's possible to:

  1. Target multiple indices — as you already do.
  2. Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
  3. Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
  4. Use a bucket_script aggregation to calculate their difference (in milliseconds).
  5. Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.

In concrete terms:

GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
  "aggs": {
    "grouped_by_id": {
      "terms": {
        "field": "id",
        "min_doc_count": 2,
        "size": 10
      },
      "aggs": {
        "min_date": {
          "min": {
            "field": "date_start"
          }
        },
        "max_date": {
          "max": {
            "field": "date_end"
          }
        },
        "diff": {
          "bucket_script": {
            "buckets_path": {
              "min": "min_date",
              "max": "max_date"
            },
            "script": "params.max - params.min"
          }
        }
      }
    },
    "avg_duration_across_the_board": {
      "avg_bucket": {
        "buckets_path": "grouped_by_id>diff",
        "gap_policy": "skip"
      }
    }
  }
}

If everything goes right, you'll end up with:

...
"aggregations" : {
  "grouped_by_id" : {
    ...
  },
  "avg_duration_across_the_board" : {
    "value" : 1.70208E10        <-- 17,020,800,000 milliseconds ~ 4,728 hrs
  }
}

⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.


📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

Upvotes: 1

Related Questions