Mikilinux
Mikilinux

Reputation: 11

ArangoDB full collection scan and JOIN

when I try to execute an inner join in ArangoDB with only one filter, like this:

FOR doc1 in catalogue
FOR doc2 in RepoNodes
 FOR doc3 in RepoEdges
 FOR doc4 in RepoNodes
   FOR doc5 in RepoEdges
          FOR doc6 in RepoNodes
            FOR doc10 in catalogue
             FOR doc11 in similarities
              FOR doc12 in clearance

 FILTER doc1.trackid== "TRAAAAK128F9318786" AND doc1.trackid==doc2.mongodbsongs
AND doc3._from==doc2._id AND doc3._to ==doc4._id AND doc5._from==doc4._id and    doc6._id ==doc5._to
AND   doc10.trackid== doc6.mongodbsongs
AND doc11._from==CONCAT("Tracks/",doc6.neo4jSong)
AND doc6.redisclearance== doc12._key
         return {doc10,doc11,doc12}

i notice that indexes are ignored.. I mean.. It executes a full collection scan. if I serialize the same query, it works perfectly. I don't understand why.. If I stop before the last 4 FOR it's ok.. where is the problem? Why is a full collection scan needed?

Upvotes: 1

Views: 315

Answers (1)

stj
stj

Reputation: 9097

n the latter query (the one that does not use indexes), the query optimizer executes a function that will move the FOR loops around in the query. It will try to create all possible permutations of FOR loops that don't change the meaning of the query.

In the latter query, the optimizer is free to move around any of the 9 FOR loops, because there are no FILTER statements between them that could be problematic. Here the optimizer will start creating new execution plans with FOR loops moved around. This could create 9! (faculty, i.e. 362880) different execution plans, but by default the optimizer stops at 192 plans in order to avoid that combinatoric explosion. And when that number of plan is reached, the optimizer will stop applying some further optimizations in order to put a cap on the overall optimization runtime.

That's the reason why in the second query the EnumerateCollectionNodes are not converted into IndexNodes. The optimizer has stopped before this step because there are already so many execution plans.

In the first query, the optimizer is not as free to move around the FOR loops, because the FILTER statements restrict some of the movements. And because of that, it will not generate as much execution plans and not stop optimizing prematurely.

Release 3.0.2 contains a fix for this. A temporary workaround before that is to split the one FILTER statement in the second query into multiple smaller FILTER statements and move them near their respective FOR loops.

Upvotes: 2

Related Questions