Reputation: 37
I have an index that contains documents structured as follows:
{
"year": 2020,
"month": 10,
"day": 05,
"some_other_data": { ... }
}
the ID of each documents is constructed based on the date and some additional data from some_other_data
document, like this: _id: "20201005_some_other_unique_data"
. There is no explicit _timestamp
on the documents.
I can easily get the most recent additions by doing the following query:
{
"query": {
"match_all": {}
},
"sort": [
{"_uid": "desc"}
]
}
Now, the question is: how do I get documents that have essentially a date between day A and day B, where A is, for instance, 2020-07-12 and B is 2020-09-11. You can assume that the input date can be either integers, strings, or anything really as I can manipulate it beforehand.
edit: As requested, I'm including a sample result from the following query:
{
"size": 4,
"query": {
"match": {
"month": 7
}
},
"sort": [
{"_uid": "asc"}
]
}
The response:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1609,
"max_score": null,
"hits": [
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_andromeda_cryptic",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "Andromeda",
},
"parent_yara": {
"strain": "CrypticMut",
},
},
"sort": [
"nested#20200703_andromeda_cryptic"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_betabot_boaxxe",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "BetaBot",
},
"parent_yara": {
"strain": "Boaxxe",
},
},
"sort": [
"nested#20200703_betabot_boaxxe"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_darkcomet_zorex",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "DarkComet",
},
"parent_yara": {
"strain": "Zorex",
},
},
"sort": [
"nested#20200703_darkcomet_zorex"
]
},
{
"_index": "my_index",
"_type": "nested",
"_id": "20200703_darktrack_fake_template",
"_score": null,
"_source": {
"year": 2020,
"month": 7,
"day": 3,
"yara": {
"strain": "Darktrack",
},
"parent_yara": {
"strain": "CrypticFakeTempl",
},
},
"sort": [
"nested#20200703_darktrack_fake_template"
]
}
]
}
}
The above-mentioned query will return all documents that have matched the month
. So basically anything that was put there in July of any year. What I want to achieve, if at all possible, is getting all documents inserted after a certain date and before another certain date.
Unfortunately, I cannot migrate the data so that it has a timestamp or otherwise nicely sortable fields. Essentially, I need to figure out a logic that will say: give me all documents inserted after july 1st, and before august 2nd. The problem here, is that there are plenty of edge cases, like how to do it when start date and end date are in different years, different months, and so on.
edit: I have solved it using the painless
scripting, as suggested by Briomkez, with small changes to the script itself, as follows:
getQueryForRange(dateFrom: String, dateTo: String, querySize: Number) {
let script = `
DateTimeFormatter formatter = new DateTimeFormatterBuilder().appendPattern("yyyy-MM-dd")
.parseDefaulting(ChronoField.NANO_OF_DAY, 0)
.toFormatter()
.withZone(ZoneId.of("Z"));
ZonedDateTime l = ZonedDateTime.parse(params.l, formatter);
ZonedDateTime h = ZonedDateTime.parse(params.h, formatter);
ZonedDateTime x = ZonedDateTime.of(doc['year'].value.intValue(), doc['month'].value.intValue(), doc['day'].value.intValue(), 0, 0, 0, 0, ZoneId.of('Z'));
ZonedDateTime first = l.isAfter(h) ? h : l;
ZonedDateTime last = first.equals(l) ? h : l;
return (x.isAfter(first) || x.equals(first)) && (x.equals(last) || x.isBefore(last));
`
return {
size: querySize,
query: {
bool: {
filter: {
script: {
script: {
source: script,
lang: "painless",
params: {
l: dateFrom,
h: dateTo,
},
},
},
},
},
},
sort: [{ _uid: "asc" }],
}
}
With these changes, the query works well for my version of Elasticsearch (7.2) and the order of dates in not important.
Upvotes: 1
Views: 1710
Reputation: 577
I see (at least) two alternatives here. Either use script query or simple bool queries.
Basically, the idea is to build the a timestamp at query time, by exploiting the datetime in painless.
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "<INSERT-THE-SCRIPT-HERE>",
"lang": "painless",
"params": {
"l": "2020-07-12",
"h": "2020-09-11"
}
}
}
}
}
}
}
The script can be the following one:
// Building a ZonedDateTime from params.l
ZonedDateTime l = ZonedDateTime.parse(params.l,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from params.h
ZonedDateTime h = ZonedDateTime.parse(params.h,DateTimeFormatter.ISO_LOCAL_DATE);
// Building a ZonedDateTime from the doc
ZonedDateTime doc_date = ZonedDateTime.of(doc['year'].value, doc['month'].value, doc['day'].value, 0, 0, 0, 0, ZoneId.of('Z'));
return (x.isAfter(l) || x.equals(l)) && (x.equals(h) || x.isBefore(h));
Let us denote with x
the document you are searching and let us denote l
and h
be our lower date and higher date. Let us denote with x.year
, x.month
and x.day
to access the subfield.
So x
is contained in the range (l, h)
iff
l <= x
ANDx <= h
The first condition is met if the disjunction of the following conditions holds:
l.year < x.year
l.year == x.year AND l.month < x.month
l.year == x.year AND l.month == x.month AND l.day <= x.day
Similarly, the second condition can be expressed as the disjunction of the following conditions:
h.year > x.year
h.year == x.year AND h.month > x.month
h.year == x.year AND h.month == x.month AND h.day <= x.day
It remains to express these conditions in Elasticsearch DSL:
B-1. Using script query
Given this idea we can write a simple script query. We should substitute
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "<INSERT SCRIPT HERE>",
"lang": "painless",
"params": {
"l": {
"year": 2020,
"month": 07,
"day": 01
},
"h": {
"year": 2020,
"month": 09,
"day": 01
}
}
}
}
}
}
}
In painless you can express the Condition, considering that:
x.year
is doc['year'].value
, x.month
is doc['month'].value
, x.day
is doc['day'].value
h.year
is params.h.year
, etc.l.year
is params.l.year
, etc.B-2. Using boolean query
Now we should transform these conditions into a bool conditions. The pseudo-code is the following:
{
"query": {
"bool": {
// AND of two conditions
"must": [
{
// Condition 1
},
{
// Condition 2
}
]
}
}
}
Each Condition-X
block will look like this:
{
"bool": {
// OR
"should": [
{ // Condition-X.1 },
{ // Condition-X.2 },
{ // Condition-X.3 },
],
"minimum_should_match" : 1
}
}
So, for example, we can express [Condition-2-3] with h = 2020-09-11
we can use this range query:
{
"bool": {
"must": [
{
"range": {
"year": {
"gte": 2020,
"lte": 2020
}
}
},
{
"range": {
"month": {
"gte": 9,
"lte": 9
}
}
},
{
"range": {
"day": {
"lte": 11
}
}
}
]
}
}
Write the entire query is feasible, but I think it would be very long :)
Upvotes: 2