What software should I use for graph distributed storing and processing?

Question

Problem in a nutshell:

There's a huge amount of input data in JSON format. Like right now it's about 1 Tb, but it's going to grow. I was told that we're going to have a cluster.
I need to process this data, make a graph out of it and store it in a database. So every time I get a new JSON, I have to traverse the whole graph in a database to complete it.
Later I'm going to have a thin client in a browser, where I'm going to visualize some parts of the graph, search in it, traverse it, do some filtering, etc. So this system is not high load, just a lot of processing and data.

I have no experience in distributed systems, NoSQL databases and other "big data"-like stuff. During my little research I found out that there are too many of them and right now I'm just lost.

What I've got on my whiteboard at the moment:

Apache Spark's GraphX (GraphFrames) for distributed computing on top of some storage (HDFS, Cassanda, HBase, ...) and processor (Yarn, Mesos, Kubernetes, ...).
Some graph database. I think it's good to use a graph query language like Cipher in neo4j or Gremlin in JanusGraph/TitanDB. Neo4j is good, but it has clustering only in EE and I need something open source. So now I'm thinking about the latter ones, which have Gremlin + Cassandra + Elasticsearch by default.
Maybe I don't need any of these, just store graph as adjacency matrix in some RDBMS like Postgres and that's it.
Don't know if I need Spark in 2 or 3. Do I need it at all?

My chief told me to check out Elasticsearch. But I guess I can use it only as an additional full-text search engine.

Thanks for any reply!

What software should I use for graph distributed storing and processing?

Answers (1)

Related Questions