Reputation: 1
I have been working on pretty large dataset of friends_with graph (3.5m users, 234m friend_with edges). User collection consist of: _key,_id,name
friend_with edge collection: _key,_id,_from,_to (collection is denormalized, eg. 1->2 means also 2->1)
I'm running arangoDB community on VM (32vcores, 32GB RAM, 400GB ssd).
I need to effectively return list of friends of friends, excluding immediate friends, ordered by most common users in descending order. (In future, I'd like to also include groups and schools into the graph, and query which can add around 50m more edges and more complexity to the query).
So far I have come up with following query, which can return results almost instantly in low complexity cases, but gets into problems on complex branching:
LET userId = 'users/354537'
LET directFriends = (
FOR v IN 1..1 outbound userId graph friends_only_recommendation
RETURN v._id
)
LET friendsOfFriends = (
FOR v, p IN 2..2 outbound userId graph friends_only_recommendation
COLLECT friendOfFriend = v._id INTO paths = p
RETURN {
friendOfFriend: friendOfFriend,
pathCount: LENGTH(paths)
}
)
FOR friend in friendsOfFriends
FILTER friend.friendOfFriend NOT IN directFriends
FILTER friend.friendOfFriend != userId
SORT friend.pathCount DESC
RETURN friend
Results over 100 random users (pure query time in seconds):
total friends only: 18.279518464463763
avg friends only: 0.18279518464463762
max friends: 8.115287372027524
min friends only: 5.98470214754343e-05
Profiling result of the peak user:
Execution plan:
Id NodeType Calls Par Items Filtered Runtime [s] Comment
1 SingletonNode 1 - 1 0 0.00000 * ROOT
27 SubqueryStartNode 1 - 2 0 0.00001 - LET friendsOfFriends = ( /* subquery begin */
9 TraversalNode 310 - 309777 0 7.38076 - FOR v /* vertex (projections: `_id`) */, p /* edge */ IN 2..2 /* min..maxPathDepth */ OUTBOUND 'users/354537' /* startnode */ GRAPH 'friends_only_recommendation' /* order: dfs */
10 CalculationNode 310 - 309777 0 0.04701 - LET #16 = v.`_id` /* attribute expression */
11 CollectNode 197 - 196518 0 0.61441 - COLLECT friendOfFriend = #16 INTO paths = p /* hash */
23 SortNode 197 - 196518 0 1.59508 - SORT friendOfFriend ASC /* sorting strategy: standard */
12 CalculationNode 197 - 196518 0 0.07212 - LET #17 = { "friendOfFriend" : friendOfFriend, "pathCount" : LENGTH(paths) } /* simple expression */
28 SubqueryEndNode 1 - 1 0 0.03141 - RETURN #17 ) /* subquery end */
25 SubqueryStartNode 1 - 2 0 0.00204 - LET directFriends = ( /* subquery begin */
4 TraversalNode 1 - 519 0 0.00743 - FOR v /* vertex (projections: `_id`) */ IN 1..1 /* min..maxPathDepth */ OUTBOUND 'users/354537' /* startnode */ GRAPH 'friends_only_recommendation' /* order: dfs */
5 CalculationNode 1 - 519 0 0.00009 - LET #14 = v.`_id` /* attribute expression */
26 SubqueryEndNode 1 - 1 0 0.00009 - RETURN #14 ) /* subquery end */
24 CalculationNode 1 0 1 0 0.00228 - LET #23 = SORTED_UNIQUE(directFriends) /* simple expression */
15 EnumerateListNode 197 193 196055 462 0.72862 - FOR friend IN friendsOfFriends /* list iteration */ FILTER ((friend.`friendOfFriend` not in /* sorted */ #23) && (friend.`friendOfFriend` != "users/354537")) /* early pruning */
20 CalculationNode 197 2 196055 0 0.01556 - LET #20 = friend.`pathCount` /* attribute expression */
21 SortNode 197 0 196055 0 0.08242 - SORT #20 DESC /* sorting strategy: standard */
22 ReturnNode 197 - 196055 0 0.00021 - RETURN friend
Indexes used:
By Name Type Collection Unique Sparse Cache Selectivity Fields Stored values Ranges
9 edge edge friends_with false false false 83.92 % [ `_from` ] [ ] base OUTBOUND
4 edge edge friends_with false false false 83.92 % [ `_from` ] [ ] base OUTBOUND
Traversals on graphs:
Id Depth Vertex collections Edge collections Options Filter / Prune Conditions
9 2..2 users friends_with uniqueVertices: none, uniqueEdges: path, order: dfs
4 1..1 users friends_with uniqueVertices: none, uniqueEdges: path, order: dfs
Optimization rules applied:
Id Rule Name Id Rule Name Id Rule Name
1 move-calculations-up 5 move-filters-up-2 9 fuse-filters
2 move-filters-up 6 sort-in-values 10 move-filters-into-enumerate
3 remove-unnecessary-calculations 7 optimize-traversals 11 async-prefetch
4 move-calculations-up-2 8 move-calculations-down 12 splice-subqueries
Query Statistics:
Writes Exec Writes Ign Doc. Lookups Scan Full Scan Index Cache Hits/Misses Filtered Peak Mem [b] Exec Time [s]
0 0 0 0 621106 520 / 0 462 103284736 10.58945
Query Profile:
Query Stage Duration [s] Query Stage Duration [s] Query Stage Duration [s]
initializing 0.00000 loading collections 0.00000 instantiating executors 0.00021
parsing 0.00035 instantiating plan 0.00014 executing 10.58802
optimizing ast 0.00002 optimizing plan 0.00084 finalizing 0.00006
As we are talking graph traversal, the peak time (8s for query) seems too high to me. I am new to arangoDB and AQL (well, even graphDB all together), so there might be an obvious thing I am missing. I suspect I either missunderstood indexing or the query is wrong.
All input is appreciated!
Upvotes: 0
Views: 33