ArangoDB - IN operator very slow

Question

I have some performance issues to use the IN operator in ArangoDB.

Briefly, I have an array of computed (at runtime) ids, coming from a splitting function, and I want to get only the selected elements of the Collection and, after that, collect or filter other data.

Here you can find the AQL Query:

LET toInclude = SPLIT('Collection/1,Collection/2,Collection/3', ',')

FOR result IN Collection
  FILTER result._id IN toInclude  
  COLLECT property = result.property 
  WITH COUNT INTO count
return  {property, count}

The elements in the array toInclude can be 300000+, and the query can spend more than 10 minutes to do the job.

The split - function finishes in 3sec, the property field is indexed, so the problem is in the IN operator.

What can I do to solve this performances issues?

Thank you very much!

Daniele

stj · Accepted Answer

I tried the query with 500,000 string entries in toInclude on a collection of 100,000 documents.

It indeed took very long to complete with 2.7. Execution time was around 4xx seconds. The query spends a lot of time evaluating the FILTERs IN operator there. In fact the FILTER condition will be evaluated for each document found. This will do around 100,000 x 500,000 / 2 comparisons with the data I used.

In 2.8 the same query takes around 2.7 seconds using the same data, so the problem seems to not occur there. There have been lots of optimizer changes in 2.8, and the one that's responsible for the speedup is that the IN expression will be evaluated directly in the index. The FILTER will be optimized away there.

So one fix will be to use ArangoDB 2.8 when it's available (currently it's in beta).

Another fix would be to improve the optimizer to detect that the right-hand side of the IN is const in the query so it can sort the result and can do IN using a binary search (logarithmic instead of linear complexity). But that's not yet available.

A workaround for 2.7 is to compute the IN list separately and insert it into the query as an array already. That way the IN list will be a constant value and the optimizer will be able to pre-sort it so it can use a binary search. However this requires the SPLIT operation to be performed outside of/before the original query.

update: in 2.8 there is now an additional optimizer rule to pre-sort IN list values for cases like the above and others. This enables the IN operator to use a binary search, with logarithmic complexity instead of linear complexity that it had for some cases. This change will be included in 2.8 beta2.

ArangoDB - IN operator very slow

Answers (1)

Related Questions