Reputation: 51
I'm trying to figure out what kind of graph algorithm / what type of graph structure neo4j use.
It has directed and labelled edges. Also can store properties in nodes and edges.
Upvotes: 1
Views: 215
Reputation: 30407
Neo4j is a labeled property graph.
For physical structure, we have stores for things: a node store, a relationship store, a property store, a label store, a strings store, etc.
These are just files of multiple entries in fixed formats, so every entry in the graph store has the same formatting, every entry in the relationship store has the same formatting, etc, and there are pointers (graph ids) between the entries. So graph ids are actually the offsets used for pointer hopping between linked and relevant elements in the stores.
Logically, a node entry in the node store has various pointers to other entries in other stores, such as its labels, and the first relationship in the relationship chain. Relationships have pointers to the start and end nodes of the relationship and the previous and next relationship in the chain. And both nodes and relationships have pointers to the first property entry that make up the start of their property chains.
So logically we have the equivalent of linked list structures connecting some of our elements (property chains per node, relationship chains per node). But for the connection between nodes via relationships, what you draw on the whiteboard when modeling a graph is a pretty good approximation of what's going on in the structure, nodes and relationships link directly to each other, pointing at each other's entry in the relevant store files.
We do have a bit of redundancy to speed up lookups...nodes hold some amount of labels in the node entry itself, and relationships hold their type.
This is also the simplest abstraction, we do have a slightly different format we use on dense nodes such that we bucket relationships on a node by their type and direction, allowing fast selection of relevant relationships by what we're looking for rather than having to consider and filter all relationships present.
So that's a high level overview of the physical structure. What's more interesting though is why this kind of structure was chosen, what it allows, and how that factors into why native graph databases are competitive vs relational databases for graphy and traversal-heavy use cases.
The point of all this is that relationship traversal doesn't use table joins or their equivalent. The cost to traverse relationships/patterns isn't based on the total relationships or nodes in the graph, but only on the actual relationships and nodes present, O(1) pointer hopping between the stores, node -> rel -> node etc.
This is a characteristic of native graph databases known as index-free adjacency, and is enabled by the store formats we use. This is one of the primary reasons, for graphy use cases and connected data, you would consider a native graph database over a relational database (or over a non-native graph database or overlay that just sits on top of a relational database).
So as an example, if we are at node 1 with 3 relationships to other nodes, and there are 10 billion nodes in the graph and 50 billion relationships, the traversal to the next node in the pattern will be very fast. It doesn't matter how many relationships or nodes there are total, it only matters that from the current node, there are only 3 relationships present, so we do three O(1) pointer hops (and any filtering) and continue traversal from those three nodes as try to find paths that fit the pattern. The cost of traversal is proportional to whatever parts of the graph you end up touching/traversing, not the total size of the graph.
Of course to create the relationships, you often have to do the equivalent of table joins...MATCHing on nodes by certain properties (usually using an index structure used to find entry points in the graph), then creating the relationship between them.
So yes index lookups do come into play, but these happen during relationship creation. Once the relationships are created, then querying over them becomes O(1) per traversal, pointer hopping, plus the cost of whatever additional filtering you want to do. We offset the join cost to relationship creation so that runtime traversal speed is optimized.
Remaining index usage is for finding entry points in the graph, and once we have those, traversal takes the spotlight and once again we're just pointer hopping, no datasize-dependent table joins.
An older blog entry of one of my colleagues, Max De Marzi, includes a description and visualization of the store format that may help. It is a little outdated in some respects (both the blog entry and video linked from it are from 2012), and we're always looking for ways to optimize and adjust our store formatting to improve performance, but the spirit of that entry is what matters:
Relationships are kind of like a manifested table join, but we pay the cost on relationship creation, not on traversal. For traversal, we're just pointer hopping, and that doesn't slow down as your data grows, the way table joins do in a relational database.
Upvotes: 2