Giant Elasticsearch query

Question

I have a list of must and must_not items that I currently have in a giant query but I want to know if this is the best way about the problem.

example of the query:

{"query":{ "bool" : { "must" : {"match" : {"tag":"apple"}}, "must_not": [{ 
 "match": { "city": "new york" }},{ "match": { "name": "pizza" }},...........
]}}}

I have 470 must items and 485 must_not items that are a whitelist/blacklist type of rules for data. The analytic is built in spark and the data is housed in elastic search. The query I am passing to spark is a query with one of the must followed by all 485 must_not items. As you can guess the query itself is rather large and takes around 2 seconds to return the results. I am submitting this type of query for each of the must items so therefore 470 queries passed. This application currently takes around 22 min to complete.

My question - Is this the best way to tackle this problem and is this even a good problem for elasticsearch at all given the gigantic query? I have previously attempted to preform spark joins with the data after just passing a query with just the must_not data, which takes far longer than the 470 elastic search individual queries. I used a broadcast hash join because the must data is smaller that the resultant data frame.

Thank you for the help.

nkasturi · Accepted Answer

Each invocation of the query involves following overheads:

Query preparation and submission
Spark - elastic search handshaking.
Query execution on elastic.(involves index scan etc...)
Serialization and Deserialization
Network transfer.

The solutions proposed below will improve performance by avoiding some of the overheads(listed below).

Query preparation and submission
Spark - elastic search handshaking.
Query execution on elastic.(involves index scan etc...)

If you don't need the separation of the documents resultant of individual query executions (470 musts), the ideal way is to build a boolean term using OR operator like @khachik suggested.

In the event you need the separation, two possible solutions

Solution 1 :

build the elastic query to return the scripted fields (and fields of your choice) having must_not in the matching criteria. In your case they will be 470 scripted fields one for each term , where each field is of boolean type. value will be TRUE if the 'must' term present else FALSE. Once result is cached as rdd, you may run multiple queries using filter of your choice. since data is cached in memory, the queries should be faster.

Solution: 2

First execute the query on elastic side with only must_not. Develop a function that returns "TRUE" given [TOKEN,TAG_VALUE]. Now append 470 boolean columns to the RDD using the function. Cache the RDD. Now you can segregate the data by running queries using simple filters.

Giant Elasticsearch query

Answers (1)

Related Questions