Reading a large graph from Titan (on HBase) into Spark

Question

I am researching Titan (on HBase) as a candidate for a large, distributed graph database. We require both OLTP access (fast, multi-hop queries over the graph) and OLAP access (loading all - or at least a large portion - of the graph into Spark for analytics).

From what I understand, I can use the Gremlin server to handle OLTP-style queries where my result-set will be small. Since my queries will be generated by a UI I can use an API to interface with the Gremlin server. So far, so good.

The problem concerns the OLAP use case. Since the data in HBase will be co-located with the Spark executors, it would be efficient to read the data into Spark using an HDFSInputFormat. It would be inefficient (impossible, in fact, given the projected graph size) to execute a Gremlin query from the driver and then distribute the data back to the executors.

The best guidance I have found is an un-concluded discussion from the Titan GitHub repo (https://github.com/thinkaurelius/titan/issues/1045) which suggests that (at least for a Cassandra back-end) the standard TitanCassandraInputFormat should work for reading Titan tables. Nothing is claimed about HBase backends.

However, upon reading about the underlying Titan data model (http://s3.thinkaurelius.com/docs/titan/current/data-model.html) it appears that parts the "raw" graph data is serialized, with no explanation as to how to reconstruct a property graph from the contents.

And so, I have two questions:

1) Is everything that I have stated above correct, or have I missed / misunderstood anything?

2) Has anyone managed to read a "raw" Titan graph from HBase and reconstruct it in Spark (either in GraphX or as DataFrames, RDDs etc)? If so, can you give me any pointers?

Reading a large graph from Titan (on HBase) into Spark

Answers (1)

Related Questions