Reputation: 564

How do ArangoDB Graph Traversal Queries Execute in a Cluster?

In the description of SmartGraphs here it seems to imply that graph traversal queries actually follow edges from machine to machine until the query finishes executing. Is that how it actually works? For example, suppose that you have the following query that retrieves 1-hop, 2-hop, and 3-hop friends starting from the person with id 12345:

FOR p IN Person
  FILTER p._key == 12345
  FOR friend IN 1..3 OUTBOUND p knows
    RETURN friend

Can someone please walk me through the lifetime of this query starting from the client and ending with the results on the client?

Upvotes: 1

Answers (1)

MrPieces

Reputation: 1

what actually happens can be a bit different compared to the schemas on our website. What we show there is kind of a "worst case" where the data can not be sharded perfectly (just to make it a bit more fun). But let's take a quick step back first to describe the different roles within an ArangoDB cluster. If you are already aware of our cluster lingo/architecture, please skip the next paragraph.

You have the coordinator which, as the name says, coordinates the query execution and is also the place where the final result set gets built up to send it back to the client. Coordinators are stateless, host a query engine and is are the place where Foxx services live. The actual data is stored on the DBservers in a stateful fashion but DBservers also have a distributed query engine which plays a vital role in all our distributed query processing. The brain of the cluster is the agency with at least three agents running the RAFT consensus protocol.

When you sharded your graph data set as a SmartGraph, then the following happens when a query is being sent to a Coordinator. - The Coordinator knows which data needed for the query resides on which machine and distributes the query accordingly to the respective DBservers. - Each DBserver has its own query engine and processes the incoming query from the Coordinator locally and then sends the intermediate result back to the coordinator where the final result set gets put together. This runs in parallel. - The Coordinator sends then result back to the client.

In case you have a perfectly shardable graph (e.g. a hierarchy with its branches being the shards //Use Case could be e.g. Bill of Materials or Network Analytics) then you can achieve the performance close to a single instance because queries can be sent to the right DBservers and no network hops are required. If you have a much more "unstructured" graph like a social network where connections can occur among any two given vertices, sharding becomes an optimization question and, depending on the query, it is more likely that network hops between servers occur. This latter case is shown in the schemas on our website. In his case, the SmartGraph feature can minimize the network hops needed to a minimum but not completely.

Hope this helped a bit.

Upvotes: 0

How do ArangoDB Graph Traversal Queries Execute in a Cluster?

Answers (1)

Related Questions