Reputation: 81
Rank beginner at ElasticSearch here.
I have a list of customers, and their orders as a nested field. Assuming a document structure like:
[
{ customerId: 123,
birthday: 1980-01-01,
orders: [
{
orderValue: 1500,
orderDate: 2018-12-18T12:18:12Z
},
[...]
]
},
[...]
}
What I'd like to query is: The list of users who ordered for a certain amount from between two dates. And I'd like to be able to combine that with a range query for, for example, birthday.
I've gotten to the point where I can get the sum ordered between two dates per subscriber using aggregations:
{
"size": 0,
"aggs": {
"foo": {
"nested": {
"path": "orders"
},
"aggs": {
"grouped_by_customerId": {
"terms": {
"field": "orders.customerId.keyword"
},
"aggs": {
"filtered_by_date": {
"filter": {
"range": {
"orders.orderDate": {
"from": "2018-01-28",
"to": null,
"include_lower": false,
"include_upper": true,
"format": "yyyy-MM-dd",
"boost": 1
}
}
},
"aggs": {
"sum": {
"sum": {
"field": "orders.orderValue"
}
}
}
}
}
}
}
}
}
}
However, I'd like to limit the results I get back in the Query part, to mix better with all our other filters.
My first thought was to have a script filter and pass the bounding dates and minimum value in as parameters, but then I'd have to iterate over a doc's nested documents, and that doesn't seem to work.
Is that last idea possible, and if so, how?
Thanks!
Upvotes: 3
Views: 2435
Reputation: 81
Finally solved this myself, using a Function Score query as follows:
{
"query": {
"bool": {
"must": [
{
"function_score": {
"min_score": 1,
"query": {
"nested": {
"path": "orders",
"ignore_unmapped": false,
"score_mode": "min",
"boost": 1,
"query": {
"range": {
"orders.orderDate": {
"from": "2018-12-10",
"to": null,
"include_lower": true,
"include_upper": true,
"format": "yyyy-MM-dd",
"boost": 1
}
}
}
}
},
"functions": [
{
"filter": {
"match_all": {}
},
"script_score": {
"script": {
"source": "ArrayList x = params['_source']['orders'];if (x == null) { return 0 }long result = x.stream().filter(order -> { if(params.startDate != null && !ZonedDateTime.parse(order.orderDate).isAfter(ZonedDateTime.parse(params.startDate))) return false; return true}).mapToLong(order->Long.parseLong(order.orderValue)).sum();if(params.operator == 'GT') return result > params.totalOrderValue ? 2 : 0;else if (params.operator == 'GE') return result >= params.totalOrderValue ? 3 : 0;else if (params.operator == 'LE') return result <= params.totalOrderValue ? 4 : 0;else if(params.operator == 'LT') return result < params.totalOrderValue ? 5 : 0;return result == params.totalOrderValue ? 6 : 0",
"lang": "painless",
"params": {
"totalOrderValue": 120,
"operator": "GE",
"startDate": "2012-12-10T23:00:00.000Z"
}
}
}
}
],
"score_mode": "multiply",
"max_boost": 3.4028235e+38,
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
The actual scores this produces are debugging output to test the operator, but a min_score of 1 means any of them match. Using _source is quite slow.
Without a query in the function_score it works, but takes 20 seconds or so to muddle through 3 million records. With the query, you only look at customers with orders that actually match the date range.
Since the painless script processes the entire list of orders, it has to redo the date math. Some optimization to do there, but at least I have a Proof of Concept.
I've seen this question before without a satisfactory answer, so hopefully someone finds this useful.
Upvotes: 5