Bezruc
Bezruc

Reputation: 1

Is there a way of optimising following query for better edge case respose times?

Problem specification:

I have been working on pretty large dataset of friends_with graph (3.5m users, 234m friend_with edges). User collection consist of: _key,_id,name

friend_with edge collection: _key,_id,_from,_to (collection is denormalized, eg. 1->2 means also 2->1)

I'm running arangoDB community on VM (32vcores, 32GB RAM, 400GB ssd).

I need to effectively return list of friends of friends, excluding immediate friends, ordered by most common users in descending order. (In future, I'd like to also include groups and schools into the graph, and query which can add around 50m more edges and more complexity to the query).

Current situation:

So far I have come up with following query, which can return results almost instantly in low complexity cases, but gets into problems on complex branching:

LET userId = 'users/354537'

LET directFriends = (
  FOR v IN 1..1 outbound userId graph friends_only_recommendation
  RETURN v._id
)

LET friendsOfFriends = (
  FOR v, p IN 2..2 outbound userId graph friends_only_recommendation
  COLLECT friendOfFriend = v._id INTO paths = p
  RETURN {
    friendOfFriend: friendOfFriend,
    pathCount: LENGTH(paths)
  }
)

FOR friend in friendsOfFriends
FILTER friend.friendOfFriend NOT IN directFriends
FILTER friend.friendOfFriend != userId
SORT friend.pathCount DESC
RETURN friend
Results over 100 random users (pure query time in seconds):
total friends only:         18.279518464463763
avg friends only:           0.18279518464463762
max friends:                8.115287372027524
min friends only:           5.98470214754343e-05

Profiling result of the peak user:

Execution plan:
 Id   NodeType            Calls   Par    Items   Filtered   Runtime [s]   Comment
  1   SingletonNode           1     -        1          0       0.00000   * ROOT
 27   SubqueryStartNode       1     -        2          0       0.00001     - LET friendsOfFriends = ( /* subquery begin */
  9   TraversalNode         310     -   309777          0       7.38076       - FOR v  /* vertex (projections: `_id`) */, p  /* edge */ IN 2..2  /* min..maxPathDepth */ OUTBOUND 'users/354537' /* startnode */  GRAPH 'friends_only_recommendation' /* order: dfs */
 10   CalculationNode       310     -   309777          0       0.04701         - LET #16 = v.`_id`   /* attribute expression */
 11   CollectNode           197     -   196518          0       0.61441         - COLLECT friendOfFriend = #16 INTO paths = p   /* hash */
 23   SortNode              197     -   196518          0       1.59508         - SORT friendOfFriend ASC   /* sorting strategy: standard */
 12   CalculationNode       197     -   196518          0       0.07212         - LET #17 = { "friendOfFriend" : friendOfFriend, "pathCount" : LENGTH(paths) }   /* simple expression */
 28   SubqueryEndNode         1     -        1          0       0.03141         - RETURN  #17 ) /* subquery end */
 25   SubqueryStartNode       1     -        2          0       0.00204     - LET directFriends = ( /* subquery begin */
  4   TraversalNode           1     -      519          0       0.00743       - FOR v  /* vertex (projections: `_id`) */ IN 1..1  /* min..maxPathDepth */ OUTBOUND 'users/354537' /* startnode */  GRAPH 'friends_only_recommendation' /* order: dfs */
  5   CalculationNode         1     -      519          0       0.00009         - LET #14 = v.`_id`   /* attribute expression */
 26   SubqueryEndNode         1     -        1          0       0.00009         - RETURN  #14 ) /* subquery end */
 24   CalculationNode         1     0        1          0       0.00228     - LET #23 = SORTED_UNIQUE(directFriends)   /* simple expression */
 15   EnumerateListNode     197   193   196055        462       0.72862     - FOR friend IN friendsOfFriends   /* list iteration */   FILTER ((friend.`friendOfFriend` not in /* sorted */ #23) && (friend.`friendOfFriend` != "users/354537"))   /* early pruning */
 20   CalculationNode       197     2   196055          0       0.01556       - LET #20 = friend.`pathCount`   /* attribute expression */
 21   SortNode              197     0   196055          0       0.08242       - SORT #20 DESC   /* sorting strategy: standard */
 22   ReturnNode            197     -   196055          0       0.00021       - RETURN friend

Indexes used:
 By   Name   Type   Collection     Unique   Sparse   Cache   Selectivity   Fields        Stored values   Ranges
  9   edge   edge   friends_with   false    false    false       83.92 %   [ `_from` ]   [  ]            base OUTBOUND
  4   edge   edge   friends_with   false    false    false       83.92 %   [ `_from` ]   [  ]            base OUTBOUND

Traversals on graphs:
 Id  Depth  Vertex collections  Edge collections  Options                                              Filter / Prune Conditions
 9   2..2   users               friends_with      uniqueVertices: none, uniqueEdges: path, order: dfs  
 4   1..1   users               friends_with      uniqueVertices: none, uniqueEdges: path, order: dfs  

Optimization rules applied:
 Id   Rule Name                               Id   Rule Name                               Id   Rule Name                      
  1   move-calculations-up                     5   move-filters-up-2                        9   fuse-filters                   
  2   move-filters-up                          6   sort-in-values                          10   move-filters-into-enumerate    
  3   remove-unnecessary-calculations          7   optimize-traversals                     11   async-prefetch                 
  4   move-calculations-up-2                   8   move-calculations-down                  12   splice-subqueries              

Query Statistics:
 Writes Exec      Writes Ign      Doc. Lookups      Scan Full      Scan Index      Cache Hits/Misses      Filtered      Peak Mem [b]      Exec Time [s]
           0               0                 0              0          621106                520 / 0           462         103284736           10.58945

Query Profile:
 Query Stage                Duration [s]         Query Stage                Duration [s]         Query Stage                Duration [s]
 initializing                    0.00000         loading collections             0.00000         instantiating executors         0.00021
 parsing                         0.00035         instantiating plan              0.00014         executing                      10.58802
 optimizing ast                  0.00002         optimizing plan                 0.00084         finalizing                      0.00006

As we are talking graph traversal, the peak time (8s for query) seems too high to me. I am new to arangoDB and AQL (well, even graphDB all together), so there might be an obvious thing I am missing. I suspect I either missunderstood indexing or the query is wrong.

All input is appreciated!

Upvotes: 0

Views: 33

Answers (0)

Related Questions