Reputation: 41
The project I am working on currently uses Neo4j community. Currently we process 1-5M vertices with 5-20M edges but we aim to handle a volume of 10-20M vertices w/ 50-100M edges. We are discussing the idea of switching to a graph database open source project that would enable us to scale in these proportion. Currently our mind is set on Janusgraph with Cassandra.
We have some questions regarding the capabilities and development of Janusgraph, we ould be glad if someone could answer! (Maybe Misha Brukman or Aaron Ploetz?)
On Janusgraph capabilities:
We did some experiments using the Janusgraph ready-to-use docker image, queries being issued through a java program. The java program and docker image are run on the same machine. At the magnitude of 10k-20k vertices with 50k-100k edges inserted, a query to with all the vertices possessing a give property takes 8 to 10 seconds (mean time over 10 identical queries, time elapsed before and after the command in the java program). The command itself is really simple:
g.V().has("secText", "some text").inE().outV();
Moreover, the docker image seems to break down when I try to insert more record (extending towards 100k vertices).
We wonder if it's due to the limited nature of the docker image or if there is any problem or if it could be normal? Anyway it seems really, really slow.
We set up a 2 nodes Cassandra cluster (on 2 different VMs) with Janusgraph on town, again the results were quite slow.
From what I read on the Internet, people seem to use Janusgraph deployment with millions of vertices in production, so I guess they can execute simple queries in matter of milliseconds. What is the secret there? Do you need like 128GB of RAM for the whole thing to perform correctly? Or maybe there is a guide a good practices to follow that I am unaware of? I tried my best using Janusgraph official documentation and user comments on forums like here but that ain't much I'm afraid :/
On Janusgraph future:
Thank you for reading all this and I am looking forward to all the answers you can give me :) have a nice day!
Mael
Upvotes: 4
Views: 747
Reputation: 181
Hello I know this might be late but please tell me. Are you accessing all the vertices for analysis or transactional queries ? OLAP or OLTP ? because how many vertices and edges you query and how you do that has a major effect. for example do you tell Janusgraph to return a vertex that have millions of edges with all those edges in one shot or only few of them. this is referred to as the hot vertex ( a vertex that has a lot of edges that cant be stored on one server instance ).
Upvotes: 0
Reputation: 1381
JanusGraph with Cassandra has design limitations at the storage layer which makes performance slow. In practice, its a large, scaleable, but slow graph database that offers the replication and redundancy benefits of Cassandra.
Cassandra shards data and is very good at distributing data randomly across the cluster, however this destroys data locality which is needed to make traversals fast and efficient. JanusGraph also supports several back-end storage options in addition to Cassandra, which means its not tightly tuned to any particular storage architecture.
Memory can make a difference, so verify how much memory you have allocated to the JVM on each node, use G1GC and disable swap. The VisualVM is helpful to profile your memory headroom.
Upvotes: 1