specimen
specimen

Reputation: 1765

Scaling graph traversals in ArangoDb

I have a tree-like structure, which is actually a directed acyclic graph. A small version shown below. minimal graph

At any parent, I want to sum some properties of the sub-tree. Today I'm doing this in AQL using TRAVERSAL and COLLECT based on starting node:

for c in traversal(nodes,nodeTree,ch,'inbound',{uniqueness:{vertices:'global'}})
    collect child = ch._id into group

Then I can use aggregation on the group. (With ArangoDB 2.8 I believe this can now be done directly in the collect statement.) The uniqueness option solves the problem of duplicates.

Scaling

When the tree (graph) grows into a considerable size (10-20k nodes) how will this scale? I need it to be fast, as the user will wait for the response (not a long running job).

I'm thinking of caching values in the nodes and having a dirty flag. Then in node 1 could just sum 2 and 3, if they were both clean. The problem is that 5 is included in both 2 and 3's sums.

How can I solve this? Or is it a non-problem -- are traversals really that quick?

So far I've come up with the idea of having each node include a list of it's subtree-duplicates, which in 1's case would mean the info "5 is included twice". This could be used to subtract this from 1's total. But how would I find this information? I have thought about finding all nodes with >1 parent, then traversing upwards (that's quick) and then somehow calculate this info.

Upvotes: 1

Views: 575

Answers (1)

mchacki
mchacki

Reputation: 3267

The runtime of the traversal is bounded by the amount of vertices and edges actually touched in the process. So the runtime of the traversal depends on the depth of the paths and the branching factor (how many of those vertices with multiple parents are expected).

The problem with the construct you describe is that the traversal will pick one path from 1 to 5 (say the left one) and sums all the values and eventually returns to 1 to pick the right path. Now it reaches 5 again but this time the search depth is lower than the last time 5 was seen, so it has to actually traverse the subtree on 5 once again because it could now possibly get a larger distance in this path (it does not know that all vertices on this subtree could be reached in shorter distance already). The vertices on this path will not invoke the visitor again but are still traversed and followed which costs time.

I tried to optimize the traversal to validate scaling. First i registered a new optimized visitor:

require("@arangodb/aql/functions").register("test::counter", "function (config, result, vertex) {result[0] = result[0] || {value: 0}; result[0].value += vertex.value}");

This visitor sums the values of the vertices in place and directly returns them, so i can get rid of the COLLECT statement. And i can use it my AQL:

FOR x IN TRAVERSAL(TestVertices, TestEdges, 'TestVertices/0', 'outbound', {uniqueness:{vertices:'global'}, visitor: 'test::counter', maxDepth: 5012})
  RETURN x.value

Note here: I have given a maxDepth in the options to actually search in high depth, the default is 256.

My test-tree was basically a chain of 20.000 vertices where every third vertex has an additional edge to a random vertex later in the chain (simulating your described multiple parents problem)

With this traversal i managed to search a depth of 5012 from the root in ~5 secs. Using a higher depth it growth exponentially.

I assume that your graph has fewer of these multiple parents, so i expect less runtime on your graph.

If you expect a lot more reads then writes you could also consider to compute the sums on each write. This will slow down the writes but will make all reads instant.

As an example you could use the following AQL in addition when updating a value:

LET i = (FOR x IN 1..5012 INBOUND @start TestEdges
           RETURN DISTINCT x) 
  FOR x IN i UPDATE x WITH {sum: x.sum + @add} IN TestVertices

With the bind parameters @add for the value to add and @start for the vertex that is updated. Using this technique your read query is trivial:

FOR x IN TestVertices FILTER x._id == @start 
  RETURN x.sum

Hope this helps.

Upvotes: 2

Related Questions