Reputation: 683
I've created an edge collection with about 16 Mio edges. The edges are not unique, means there are more than one edge from vertex a to vertex b. The edge collection size is about 2.4 GB data and has 1.6 GB edge index size. I am using a computer with 16 GB RAM (and additionally, 16 BG swap space).
Now I try to calculate unique edges (between each couple of vertex a-b) with a statement like this one:
FOR wf IN DeWritesWith
COLLECT from = wf._from, to = wf._to WITH COUNT INTO res
INSERT { "_from": from, "_to": to, "type": "writesWith", "numArticles": res } INTO DeWritesWithAggregated
// Does also lead to out-of-memory error:
// RETURN { "_from": from, "_to": to, "type": "writesWith", "numArticles": res }
My Problem: I always run out-of-memory (32 GB RAM). As the problem also occures when I do not want to write the result, I assume it is not a problem of huge write transaction logs. Is this normal, and can I optimize the AQL somehow? I am hoping for a solution as I think this scenario is a more generic usage scenario in graphs ...
Upvotes: 1
Views: 108
Reputation: 9097
Since ArangoDB 2.6, the COLLECT
can run in two modes:
The optimizer will choose the hash table mode automatically if it is considered to be cheaper than the sorted mode with the sort step.
The new COLLECT
implementation in 2.6 should make the selection part of the query run much faster in 2.6 than in 2.5 and before. Note that COLLECT
still produces a sorted output of its result (not its input) even with the hash table mode. This is done for compatibility with the sorted mode. This result sort step can be avoided by adding an extra SORT null
instruction after the COLLECT
statement. The optimizer can then optimize away the sorting of the result.
A blog post that explains the two modes is here: http://jsteemann.github.io/blog/2015/04/22/collecting-with-a-hash-table/
Upvotes: 1