Reputation: 683
I have been testing out Titan-Cassandra and OrientDB lately and a question came to mind.
I was just wondering how do the graphDBs shard graphs across different clusters and how do their query interface support querying on sharded graphs e.g. finding shortest path between two nodes.
I know that Gremlin implements the Mapreduce pattern for its groupby function.
But I want to know more in depth on how querying-sharding relates and how the two DBs handle querying on sharded graphs. In particular, I'm interested in how OrientDB's SQL interface supports querying across sharded graphs.
I know Neo4j argues against sharding as suggested from a previous question I've asked.
Upvotes: 2
Views: 1477
Reputation: 1702
Please see the following two posts about Titan (http://titan.thinkaurelius.com):
Typically, when you begin developing a graph application, you are using a single machine. In this model, the entire graph is on one machine. If the graph is small (data size wise) and the transactional load is low (not a massive amount of read/writes), then when you go into production, you simply add replication for high availability. With non-distributed replication, the data is fully copied to the other machines and if any one machine goes down, the others are still available to serve requests. Again, note that in this situation your data is not partitioned/distributed, just replicated.
Next, as your graph grows in size (beyond the memory and HD space of a single machine), you need to start thinking about distribution. With distribution, you partition your graph over a multi-machine cluster and (to ensure high availability) make sure you have some data redundancy (e.g. replication factor 3).
There are two ways to partition data in Titan currently:
At the end of the day, the whole story is about co-location. Can you ensure that co-retrieved data is close in physical space?
Finally, note that Titan allows for parallel reads (and writes) using Faunus (http://faunus.thinkaurelius.com). Thus, if you have an OLAP question that requires scanning the entire graph, then Titan's co-location model is handy as a vertex and its edges is a sequential read from disk. Again, the story remains the same -- co-location in space in accordance with co-retrieval in time.
Upvotes: 8