Reputation: 107
I have some performance issues to use the IN
operator in ArangoDB.
Briefly, I have an array of computed (at runtime) ids, coming from a splitting function, and I want to get only the selected elements of the Collection and, after that, collect or filter other data.
Here you can find the AQL Query:
LET toInclude = SPLIT('Collection/1,Collection/2,Collection/3', ',')
FOR result IN Collection
FILTER result._id IN toInclude
COLLECT property = result.property
WITH COUNT INTO count
return {property, count}
The elements in the array toInclude
can be 300000+, and the query can spend more than 10 minutes to do the job.
The split
- function finishes in 3sec, the property
field is indexed, so the problem is in the IN
operator.
What can I do to solve this performances issues?
Thank you very much!
Daniele
Upvotes: 2
Views: 716
Reputation: 9097
I tried the query with 500,000 string entries in toInclude
on a collection of 100,000 documents.
It indeed took very long to complete with 2.7. Execution time was around 4xx seconds. The query spends a lot of time evaluating the FILTER
s IN
operator there. In fact the FILTER
condition will be evaluated for each document found. This will do around 100,000 x 500,000 / 2 comparisons with the data I used.
In 2.8 the same query takes around 2.7 seconds using the same data, so the problem seems to not occur there. There have been lots of optimizer changes in 2.8, and the one that's responsible for the speedup is that the IN
expression will be evaluated directly in the index. The FILTER
will be optimized away there.
So one fix will be to use ArangoDB 2.8 when it's available (currently it's in beta).
Another fix would be to improve the optimizer to detect that the right-hand side of the IN
is const in the query so it can sort the result and can do IN
using a binary search (logarithmic instead of linear complexity). But that's not yet available.
A workaround for 2.7 is to compute the IN
list separately and insert it into the query as an array already. That way the IN
list will be a constant value and the optimizer will be able to pre-sort it so it can use a binary search. However this requires the SPLIT
operation to be performed outside of/before the original query.
update: in 2.8 there is now an additional optimizer rule to pre-sort IN
list values for cases like the above and others. This enables the IN
operator to use a binary search, with logarithmic complexity instead of linear complexity that it had for some cases. This change will be included in 2.8 beta2.
Upvotes: 5