data model for tree structure (file system): document model vs graph model

I'm evaluating a nosql solution for implement a file system like structure, with millions of items, where key features have to be:

speed finding "parents" or "direct childs" or "subtree childs" of an item filtered by n item properties, with page results sorted by item property.

having this requirements i split the problem in 2 task:

model the recursive items structure for search childs/subtree childs
model the item structure for search over items property

Now the power of nosql schema free is a good feature for store different properties for each file, and this is good for point 2.

I have instead some doubt over point 1 about the pros / cons to use a document database (example mongodb) with a single collection of items and a materialized path design pattern, or using a graph database (example arangodb) with 2 collection: items for data (document collection), and itemsParents for parent-child relation (edge collection) and a graph traverse function.

There are advantages in performance using a graph database for my requirements?

graph traverse is more efficient over materialized path filter to accomplish my task?

If yes, can you explain me why?

Thanks

Upvotes: 7

Answers (2)

Max Neunhöffer

Reputation: 1392

Your use case is better served by a multi-model database, which is both a document store and a graph database. With such a data store you can put all your items as vertices in one collection and have the relations for the hierarchy as edges in a separate collection. Additionally, you could store the path with every item and have a sorted index and, if constant time matters, a hash index on the path attribute.

You would get

O(1) (constant time) lookup for an item by its path (using the hash index)
O(1) lookup for a parent either by a graph neighbour or by a lookup by (truncated) path
finding all n direct children in time O(n) using graph neighbours
finding a complete subtree by a range lookup in the sorted index in time proportional to the number of items in the result
get near arbitrary fast item accesses by adding further secondary indexes

1.-4. is best possible, because the complexity cannot be better than the size of the result set.

Let me explain the performance arguments in a bit more detail:

Cases 1. and 2. are good with both approaches, since you only need a single result, which you can access directly. Whether or not you use a hash index or let a sorted index suffice will probably matter very little.

Case 3. (direct children) is faster with a graph database, since it has "find all direct neighbours" as a very fast primitive. If you rely on the materialized path and a sorted index, you cannot get the direct children with a range query and so it will be slower.

Case 4. (whole subtree) is faster with a materialized path and a range query using a sorted index, since that can just emit (using an iterator if you want) a complete range. A graph database will have to perform a graph traversal which will involve (temporary) data mutations on the server for the enumeration.

The advantage with a multi-model database is that you can combine the advantages of the two approaches (3. and 4.) you suggest in a single data store.

One possibility for such a database is the ArangoDB database.

Disclaimer: I am one of the main developers.

Upvotes: 11

Brian Underwood

Reputation: 10856

A graph database would certainly be a great choice for a hierarchical structure like a filesystem. Speaking specifically of Neo4j you could have a schema such as:

(:Folder)-[:IN_FOLDER]->(:Folder)
(:File)-[:IN_FOLDER]->(:Folder)

Finding a file or a folder is as simple as the following Cypher:

MATCH (file:File {path: '/dir/file'})
RETURN file

To find all of the files/folders directly under a folder:

MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER]-(file_or_folder)
RETURN file_or_folder

If you wanted to find all files/folders recursively you could do:

MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER*1..5]-(file_or_folder)
RETURN file_or_folder

The 1..5 adjusts the depth (from one to five levels) which you are searching.

For all of these you'd want an index on the path property for both the Folder and File labels. Of course you wouldn't need to do it this way, depending on your use-case.

The reason that Neo4j can be so much faster in this case is because once you find a node on disk the relationships can be traversed with just a few file accesses as opposed to searching an entire table or index for each hop. I recommend checking out the free book Graph Databases by O'Reilly for details on the internals of Neo4j.

Upvotes: 11

data model for tree structure (file system): document model vs graph model

Answers (2)

Related Questions