Gleb Ignatev
Gleb Ignatev

Reputation: 105

What software should I use for graph distributed storing and processing?

Problem in a nutshell:

  1. There's a huge amount of input data in JSON format. Like right now it's about 1 Tb, but it's going to grow. I was told that we're going to have a cluster.
  2. I need to process this data, make a graph out of it and store it in a database. So every time I get a new JSON, I have to traverse the whole graph in a database to complete it.
  3. Later I'm going to have a thin client in a browser, where I'm going to visualize some parts of the graph, search in it, traverse it, do some filtering, etc. So this system is not high load, just a lot of processing and data.

I have no experience in distributed systems, NoSQL databases and other "big data"-like stuff. During my little research I found out that there are too many of them and right now I'm just lost.

What I've got on my whiteboard at the moment:

  1. Apache Spark's GraphX (GraphFrames) for distributed computing on top of some storage (HDFS, Cassanda, HBase, ...) and processor (Yarn, Mesos, Kubernetes, ...).
  2. Some graph database. I think it's good to use a graph query language like Cipher in neo4j or Gremlin in JanusGraph/TitanDB. Neo4j is good, but it has clustering only in EE and I need something open source. So now I'm thinking about the latter ones, which have Gremlin + Cassandra + Elasticsearch by default.
  3. Maybe I don't need any of these, just store graph as adjacency matrix in some RDBMS like Postgres and that's it.
  4. Don't know if I need Spark in 2 or 3. Do I need it at all?

My chief told me to check out Elasticsearch. But I guess I can use it only as an additional full-text search engine.

Thanks for any reply!

Upvotes: 1

Views: 219

Answers (1)

Tom Geudens
Tom Geudens

Reputation: 2656

Let us start with a couple of follow-up questions :

  1. 1Tb is not a huge amount of data if that is also (close to) the total amount of data. Is it ? How much new data are you expecting and at what rate will it arrive.
  2. Why would you have to traverse the whole graph if each JSON is merely referring to a small part of the graph ? It's either new data or an update of existing data (which you should be able to pinpoint), isn't it ?
  3. Yes, that's how you use a graph database ...

The rest sort of depends on your answer on 1). If we're talking about IOT numbers of arriving events (tens of thousands per second ... sustained) you might need a big data solution. If not, your main problem is getting the initial load done and do some easy sailing from there ;-).

Hope this helps.

Regards, Tom

Upvotes: 1

Related Questions