Roman
Roman

Reputation: 267

Neo4j or GraphX / Giraph what to choose?

Just started my excursion to graph processing methods and tools. What we basically do - count some standard metrics like pagerank, clustering coefficient, triangle count, diameter, connectivity etc. In the past was happy with Octave, but when we started to work with graphs having let's say 10^9 nodes/edges we stuck.

So the possible solutions can be distributed cloud made with Hadoop/Giraph, Spark/GraphX, Neo4j on top of them, etc.

But since I am a beginner, can someone advise what actually to choose? I did not get the difference when to use Spark/GraphX and when Neo4j? Right now I consider Spark/GraphX, since it have more Python alike syntax, while neo4j has the own Cypher. Visualization in neo4j is cool but not useful in such a large scale. I do not understand is there a reason to use additional level of software (neo4j) or just use Spark/GraphX? Since I understood neo4j will not save so much time like if we worked with pure hadoop vs Giraph or GraphX or Hive.

Thank you.

Upvotes: 23

Views: 27706

Answers (5)

GeaFlow
GeaFlow

Reputation: 1

Neo4j is a graph database for graph query on single machine.

Spark GraphX is a distribute graph computing system for large big data. It is mainly designed for offline computing scenarios.

For streaming graph computing for large big data, you can use the GeaFlow , a distribute streaming graph computing system which can process graph computing on a trillion-point scale in seconds.

Upvotes: 0

Lei Zhang
Lei Zhang

Reputation: 1

Perhaps GRAPE-JDK is the graph computing framework you need.

GRAPE-JDK is a high-performance and user-friendly Java SDK for graph computing provided by the GraphScope community, which not only supports the native PIE computing model but also seamlessly migrates Giraph Algo and GraphX Algo to run on GraphScope. If you use GRAPE-JDK to write graph algorithms, your program can even achieve performance similar to C++ algorithms. Want to learn more? Click this link to view.Breaking the Language Barrier in Large Scale Graph Computing.

If you are interested in its principle, GRAPE-JDK encapsulates the C++ API using FastFFI, and the glue code for JNI can be automatically generated. Meanwhile, LLVM4JNI can convert bitcode compiled by LLVM into Java bytecode, thereby providing the effect of JVM JIT.

Disclaimer: I'm an author of GraphScope.

Upvotes: 0

Praveen Kumar K S
Praveen Kumar K S

Reputation: 3074

Neo4J: It is a graphical database which helps out identifying the relationships and entities data usually from the disk. It's popularity and choice is given in this link. But when it needs to process the very large data-sets and real time processing to produce the graphical results/representation it needs to scale horizontally. In this case combination of Neo4J with Apache Spark will give significant performance benefits in such a way Spark will serve as an external graph compute solution.

Mazerunner is a distributed graph processing platform which extends Neo4J. It uses message broker to process distribute graph processing jobs to Apache Spark GraphX module.


GraphX: GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. It supports multiple Graph algorithms.

Conclusion: It is always recommended to use the Hybrid combination of Neo4j with GraphX as they both easier to integrate.

For real time processing and processing large data-sets, use neo4j with GraphX.
For simple persistence and to show the entity relationship for a simple graphical display representation use standalone neo4j.

Upvotes: 24

Xinh Huynh
Xinh Huynh

Reputation: 151

Neo4j: I have not used it, but I think it does all of a graph computation (like pagerank) on a single machine. Would that be able to handle your data set? It may depend on whether your entire graph fits into memory, and if not, how efficiently does it process data from disk. It may hit the same problems you encountered with Octave.

Spark GraphX: GraphX partitions graph data (vertices and edges) across a cluster of machines. This gives you horizontal scalability and parallelism in computation. Some things you may want to consider: it only has a Scala API right now (no Python yet). It does PageRank, triangle count, and connected components, but you may have to implement clustering coefficent and diameter yourself, using the provided graph API (pregel for example). The programming guide has a list of supported algorithms: https://spark.apache.org/docs/latest/graphx-programming-guide.html

Upvotes: 8

tkroman
tkroman

Reputation: 4798

GraphX is more of a realtime processing framework for the data that can be (and it's is better when) represented in a graph form. With GraphX you can use various algorithms that require large amounts of processing power (both RAM and CPU), and with neo4j you can (reliably) persist and update that data. This is what I'd suggest.

I know for sure that @kennybastani has done some pretty interesting advancements in that area, you can take a look at his mazerunner solution. It's also shipped as a docker image, so you can poke at it with a stick and find out for yourself whether you like it or not.

This image deploys a container with Apache Spark and uses GraphX to perform ETL graph analysis on subgraphs exported from Neo4j. The results of the analysis are applied back to the data in the Neo4j database.

Upvotes: 7

Related Questions