augustin-s
augustin-s

Reputation: 683

How to make edges unique and to quantify them without out-of-memory error

I've created an edge collection with about 16 Mio edges. The edges are not unique, means there are more than one edge from vertex a to vertex b. The edge collection size is about 2.4 GB data and has 1.6 GB edge index size. I am using a computer with 16 GB RAM (and additionally, 16 BG swap space).

Now I try to calculate unique edges (between each couple of vertex a-b) with a statement like this one:

FOR wf IN DeWritesWith
        COLLECT from = wf._from, to = wf._to WITH COUNT INTO res
        INSERT { "_from": from, "_to": to, "type": "writesWith", "numArticles": res } INTO DeWritesWithAggregated
// Does also lead to out-of-memory error:        
//        RETURN { "_from": from, "_to": to, "type": "writesWith", "numArticles": res }

My Problem: I always run out-of-memory (32 GB RAM). As the problem also occures when I do not want to write the result, I assume it is not a problem of huge write transaction logs. Is this normal, and can I optimize the AQL somehow? I am hoping for a solution as I think this scenario is a more generic usage scenario in graphs ...

Upvotes: 1

Views: 108

Answers (1)

stj
stj

Reputation: 9097

Since ArangoDB 2.6, the COLLECT can run in two modes:

  • the sorted mode that uses a sort step before aggregation
  • a hash table mode that does not require an upfront sort step

The optimizer will choose the hash table mode automatically if it is considered to be cheaper than the sorted mode with the sort step.

The new COLLECT implementation in 2.6 should make the selection part of the query run much faster in 2.6 than in 2.5 and before. Note that COLLECT still produces a sorted output of its result (not its input) even with the hash table mode. This is done for compatibility with the sorted mode. This result sort step can be avoided by adding an extra SORT null instruction after the COLLECT statement. The optimizer can then optimize away the sorting of the result.

A blog post that explains the two modes is here: http://jsteemann.github.io/blog/2015/04/22/collecting-with-a-hash-table/

Upvotes: 1

Related Questions