tamaroth
tamaroth

Reputation: 37

Select documents between two dates from Elasticsearch

I have an index that contains documents structured as follows:

{
    "year": 2020,
    "month": 10,
    "day": 05,
    "some_other_data": { ... }
}

the ID of each documents is constructed based on the date and some additional data from some_other_data document, like this: _id: "20201005_some_other_unique_data". There is no explicit _timestamp on the documents.

I can easily get the most recent additions by doing the following query:

{
    "query": {
        "match_all": {}
    },
    "sort": [
        {"_uid": "desc"}
    ]
}

Now, the question is: how do I get documents that have essentially a date between day A and day B, where A is, for instance, 2020-07-12 and B is 2020-09-11. You can assume that the input date can be either integers, strings, or anything really as I can manipulate it beforehand.

edit: As requested, I'm including a sample result from the following query:

{
    "size": 4,
    "query": {
        "match": {
            "month": 7
        }
    },
    "sort": [
        {"_uid": "asc"}
    ]
}

The response:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1609,
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_type": "nested",
        "_id": "20200703_andromeda_cryptic",
        "_score": null,
        "_source": {
          "year": 2020,
          "month": 7,
          "day": 3,
          "yara": {
            "strain": "Andromeda",
          },
          "parent_yara": {
            "strain": "CrypticMut",
          },
        },
        "sort": [
          "nested#20200703_andromeda_cryptic"
        ]
      },
      {
        "_index": "my_index",
        "_type": "nested",
        "_id": "20200703_betabot_boaxxe",
        "_score": null,
        "_source": {
          "year": 2020,
          "month": 7,
          "day": 3,
          "yara": {
            "strain": "BetaBot",
          },
          "parent_yara": {
            "strain": "Boaxxe",
          },
        },
        "sort": [
          "nested#20200703_betabot_boaxxe"
        ]
      },
      {
        "_index": "my_index",
        "_type": "nested",
        "_id": "20200703_darkcomet_zorex",
        "_score": null,
        "_source": {
          "year": 2020,
          "month": 7,
          "day": 3,
          "yara": {
            "strain": "DarkComet",
          },
          "parent_yara": {
            "strain": "Zorex",
          },
        },
        "sort": [
          "nested#20200703_darkcomet_zorex"
        ]
      },
      {
        "_index": "my_index",
        "_type": "nested",
        "_id": "20200703_darktrack_fake_template",
        "_score": null,
        "_source": {
          "year": 2020,
          "month": 7,
          "day": 3,
          "yara": {
            "strain": "Darktrack",
          },
          "parent_yara": {
            "strain": "CrypticFakeTempl",
          },
        },
        "sort": [
          "nested#20200703_darktrack_fake_template"
        ]
      }
    ]
  }
}

The above-mentioned query will return all documents that have matched the month. So basically anything that was put there in July of any year. What I want to achieve, if at all possible, is getting all documents inserted after a certain date and before another certain date.

Unfortunately, I cannot migrate the data so that it has a timestamp or otherwise nicely sortable fields. Essentially, I need to figure out a logic that will say: give me all documents inserted after july 1st, and before august 2nd. The problem here, is that there are plenty of edge cases, like how to do it when start date and end date are in different years, different months, and so on.

edit: I have solved it using the painless scripting, as suggested by Briomkez, with small changes to the script itself, as follows:

  getQueryForRange(dateFrom: String, dateTo: String, querySize: Number) {
    let script = `
      DateTimeFormatter formatter = new DateTimeFormatterBuilder().appendPattern("yyyy-MM-dd")
                                      .parseDefaulting(ChronoField.NANO_OF_DAY, 0)
                                      .toFormatter()
                                      .withZone(ZoneId.of("Z"));
      ZonedDateTime l = ZonedDateTime.parse(params.l, formatter);
      ZonedDateTime h = ZonedDateTime.parse(params.h, formatter);
      ZonedDateTime x = ZonedDateTime.of(doc['year'].value.intValue(), doc['month'].value.intValue(), doc['day'].value.intValue(), 0, 0, 0, 0, ZoneId.of('Z'));

      ZonedDateTime first = l.isAfter(h) ? h : l;
      ZonedDateTime last = first.equals(l) ? h : l;
      return (x.isAfter(first) || x.equals(first)) && (x.equals(last) || x.isBefore(last));
    `

    return {
      size: querySize,
      query: {
        bool: {
          filter: {
            script: {
              script: {
                source: script,
                lang: "painless",
                params: {
                  l: dateFrom,
                  h: dateTo,
                },
              },
            },
          },
        },
      },
      sort: [{ _uid: "asc" }],
    }
  }

With these changes, the query works well for my version of Elasticsearch (7.2) and the order of dates in not important.

Upvotes: 1

Views: 1710

Answers (1)

Briomkez
Briomkez

Reputation: 577

I see (at least) two alternatives here. Either use script query or simple bool queries.

A. USE SCRIPT QUERIES

Basically, the idea is to build the a timestamp at query time, by exploiting the datetime in painless.

{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "<INSERT-THE-SCRIPT-HERE>",
            "lang": "painless",
            "params": {
              "l": "2020-07-12",
              "h": "2020-09-11" 
            }
          }
        }
      }
    }
  }
}

The script can be the following one:

// Building a ZonedDateTime from params.l
ZonedDateTime l = ZonedDateTime.parse(params.l,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from params.h
ZonedDateTime h = ZonedDateTime.parse(params.h,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from the doc
ZonedDateTime doc_date = ZonedDateTime.of(doc['year'].value, doc['month'].value, doc['day'].value, 0, 0, 0, 0, ZoneId.of('Z'));

return (x.isAfter(l) || x.equals(l)) && (x.equals(h) || x.isBefore(h));

B. ALTERNATIVE: splitting the problem in its building blocks

Let us denote with x the document you are searching and let us denote l and h be our lower date and higher date. Let us denote with x.year, x.month and x.day to access the subfield.

So x is contained in the range (l, h) iff

  1. [Condition-1] l <= x AND
  2. [Condition-2] x <= h

The first condition is met if the disjunction of the following conditions holds:

  1. [Condition-1.1] l.year < x.year
  2. [Condition-1.2] l.year == x.year AND l.month < x.month
  3. [Condition-1.3] l.year == x.year AND l.month == x.month AND l.day <= x.day

Similarly, the second condition can be expressed as the disjunction of the following conditions:

  1. [Condition-2.1] h.year > x.year
  2. [Condition-2.2] h.year == x.year AND h.month > x.month
  3. [Condition-2.3] h.year == x.year AND h.month == x.month AND h.day <= x.day

It remains to express these conditions in Elasticsearch DSL:

B-1. Using script query

Given this idea we can write a simple script query. We should substitute

{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "<INSERT SCRIPT HERE>",
            "lang": "painless",
            "params": {
              "l": {
                "year": 2020,
                "month": 07,
                "day": 01
              },
              "h": {
                "year": 2020,
                "month": 09,
                "day": 01
            }
          }
        }
      }
    }
  }
}

In painless you can express the Condition, considering that:

  • x.year is doc['year'].value, x.month is doc['month'].value, x.day is doc['day'].value
  • h.year is params.h.year, etc.
  • l.year is params.l.year, etc.

B-2. Using boolean query

Now we should transform these conditions into a bool conditions. The pseudo-code is the following:

{
  "query": {
    "bool": {
      // AND of two conditions
      "must": [
       
        {
          // Condition 1
        },
       
        {
          // Condition 2
        }
      ]
    }
  }
}

Each Condition-X block will look like this:

{
  "bool": {
    // OR
    "should": [
     { // Condition-X.1 },
     { // Condition-X.2 },
     { // Condition-X.3 },
    ],
    "minimum_should_match" : 1
  }
}

So, for example, we can express [Condition-2-3] with h = 2020-09-11 we can use this range query:

{
  "bool": {
    "must": [
      {
        "range": {
          "year": {
            "gte": 2020,
            "lte": 2020
          }
        }
      },
      {
        "range": {
          "month": {
            "gte": 9,
            "lte": 9
          }
        }
      },
      {
        "range": {
          "day": {
            "lte": 11
          }
        }
      }
    ]
  }
}

Write the entire query is feasible, but I think it would be very long :)

Upvotes: 2

Related Questions