Avner Levy
Avner Levy

Reputation: 6741

Store time related data in ElasticSearch

I want to store time-related data in ElasticSearch. Each document has a start time and an end time which define when it was relevant (an open ticket for example). Then I want to be able to run queries such as:

  1. Show all tickets that were open between January 5th to February 2th.
  2. Show facets on multiple fields for all tickets opened in some time range

Since I'll have monthly indexes, documents spanning more than a month will be stored in more than one index.
For example, a ticket opened between January and April needs to be stored in the indexes of all 4 months' indexes.
I was wondering if there is an easy way to run later on aggregations across the indexes which will know to take each ticket only once into account (for faceting and such).

For example, if I have the following tickets:

  1. Ticket A is open between 1/1/2021-17/1/2021
  2. Ticket B is open between 15/1/2021-16/2/2021
  3. Ticket C is open between 12/1/2021-16/3/2021

We will have the following documents/indexes:

Index January:  
  {id: A, StartDate: 1/1/2021, EndDate: 17/1/2021}
  {id: B, StartDate: 15/1/2021}
  {id: C, StartDate: 12/1/2021}
Index February:  
  {id: B, StartDate: 15/1/2021, EndDate: 16/2/2021}
  {id: C, StartDate: 12/1/2021}
Index March:  
  {id: C, StartDate: 12/1/2021, endDate: 16/3/2021}

Any inputs related to the best way to implement such a thing with ElasticSearch are most welcomed. If there are other preferred databases for the job with high scale and fast aggregations I'll be happy to hear on alternatives as well.

Upvotes: 1

Views: 583

Answers (1)

Joe - Check out my books
Joe - Check out my books

Reputation: 16895

This'll be pretty straightforward if both of these conditions are fulfilled:

  1. Each document has precisely one StartDate value.
  2. Each document that has an EndDate also has a StartDate.

So, if we set up an index template covering all our future monthly indices:

PUT _index_template/monthly-indices
{
  "index_patterns": [
    "index-2021-*"
  ],
  "template": {
    "mappings": {
      "properties": {
        "id": {
          "type": "keyword"
        },
        "StartDate": {
          "type": "date",
          "format": "d/M/yyyy"
        },
        "EndDate": {
          "type": "date",
          "format": "d/M/yyyy"
        }
      }
    }
  }
}

we can then add the sample docs from your question:

POST index-2021-01/_doc/A
{
  "id": "A",
  "StartDate": "1/1/2021",
  "EndDate": "17/1/2021"
}

POST index-2021-01/_doc/B
{
  "id": "B",
  "StartDate": "15/1/2021"
}

POST index-2021-02/_doc/B
{
  "id": "B",
  "StartDate": "15/1/2021",
  "EndDate": "16/2/2021"
}

POST index-2021-02/_doc/C
{
  "id": "C",
  "StartDate": "12/2/2021"
}

POST index-2021-03/_doc/C
{
  "id": "C",
  "StartDate": "12/2/2021",
  "EndDate": "16/3/2021"
}

After that, the question "Which tickets were open between 1/1/2021 and 31/3/2021?" can be then answered through:

POST index-2021-*/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "StartDate": {
              "gte": "1/1/2021"
            }
          }
        },
        {
          "range": {
            "EndDate": {
              "lte": "31/3/2021"
            }
          }
        }
      ]
    }
  }
}

which returns the docs A, B, and C -- but obviously only those entries that have both start & end dates.

Moving forward, a simple "facet" statistic denoting "how many tickets were open month-over-month" can be constructed using this scripted date_histogram aggregation:

POST index-2021-*/_search
{
  "size": 0,
  "aggs": {
    "open_count_by_months": {
      "date_histogram": {
        "interval": "month",
        "format": "MMMM yyyy", 
        "script": """
          def start = doc['StartDate'].value;
          def end = doc['EndDate'].size() == 0 ? null : doc['EndDate'].value;
          
          if (end == null) {
            return start
          }
          
          return end
        """
      }
    }
  }
}

yielding

"aggregations" : {
  "open_count_by_months" : {
    "buckets" : [
      {
        "key_as_string" : "January 2021",
        "key" : 1609459200000,
        "doc_count" : 2
      },
      {
        "key_as_string" : "February 2021",
        "key" : 1612137600000,
        "doc_count" : 2
      },
      {
        "key_as_string" : "March 2021",
        "key" : 1614556800000,
        "doc_count" : 1
      }
    ]
  }
}

Wrapping up, if you manage to keep solid control over the docs' content "spread" across multiple indices, you'll be OK.

If, however, you just let a cronjob sync the tickets every day and construct your docs such that they act like messages "announcing" the current ticket status, i.e.:

{
  "ticket_id": A,
  "status": "OPEN"
  "timestamp": 1613428738570
}

it'll introduce some nasty complexity that I discussed a while back here and here.

Upvotes: 2

Related Questions