Reputation: 1089
I am trying to benchmark three different graph database the Titan, the OrientDB and the Neo4j. I want to measure the execution time for the database creation. As a test case I use this dataset http://snap.stanford.edu/data/web-flickr.html . Although the data are stored locally I and not in the computers memory I have noticed it is consumed a lot of memory and unfortunately after a while eclipse crashes. Why is this happening?
Here's some code snippets: Titan graph creation
public long createGraphDB(String datasetRoot, TitanGraph titanGraph) {
long duration;
long startTime = System.nanoTime();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
String line;
int lineCounter = 1;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split(" ");
Vertex srcVertex = titanGraph.addVertex(null);
srcVertex.setProperty( "nodeId", parts[0] );
Vertex dstVertex = titanGraph.addVertex(null);
dstVertex.setProperty( "nodeId", parts[1] );
Edge edge = titanGraph.addEdge(null, srcVertex, dstVertex, "similar");
titanGraph.commit();
}
lineCounter++;
}
reader.close();
}
catch(IOException ioe) {
ioe.printStackTrace();
}
catch( Exception e ) {
titanGraph.rollback();
}
long endTime = System.nanoTime();
duration = endTime - startTime;
return duration;
}
OrientDB graph creation:
public long createGraphDB(String datasetRoot, OrientGraph orientGraph) {
long duration;
long startTime = System.nanoTime();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
String line;
int lineCounter = 1;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split(" ");
Vertex srcVertex = orientGraph.addVertex(null);
srcVertex.setProperty( "nodeId", parts[0] );
Vertex dstVertex = orientGraph.addVertex(null);
dstVertex.setProperty( "nodeId", parts[1] );
Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");
orientGraph.commit();
}
lineCounter++;
}
reader.close();
}
catch(IOException ioe) {
ioe.printStackTrace();
}
catch( Exception e ) {
orientGraph.rollback();
}
long endTime = System.nanoTime();
duration = endTime - startTime;
return duration;
Neo4j graph creation:
public long createDB(String datasetRoot, GraphDatabaseService neo4jGraph) {
long duration;
long startTime = System.nanoTime();
Transaction tx = neo4jGraph.beginTx();
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(datasetRoot)));
String line;
int lineCounter = 1;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split(" ");
Node srcNode = neo4jGraph.createNode();
srcNode.setProperty("nodeId", parts[0]);
Node dstNode = neo4jGraph.createNode();
dstNode.setProperty("nodeId", parts[1]);
Relationship relationship = srcNode.createRelationshipTo(dstNode, RelTypes.SIMILAR);
}
lineCounter++;
}
tx.success();
reader.close();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
tx.finish();
}
long endTime = System.nanoTime();
duration = endTime - startTime;
return duration;
}
EDIT: I tried the BatchGraph solution and it seems that it will run forever. It run the whole night yesterday and it never came to an end. I had to stop it. Is there anything wrong in my code?
TitanGraph graph = TitanFactory.open("data/titan");
BatchGraph<TitanGraph> batchGraph = new BatchGraph<TitanGraph>(graph, VertexIDType.STRING, 1000);
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("data/flickrEdges.txt")));
String line;
int lineCounter = 1;
while((line = reader.readLine()) != null) {
if(lineCounter > 4) {
String[] parts = line.split(" ");
Vertex srcVertex = batchGraph.getVertex(parts[0]);
if(srcVertex == null) {
srcVertex = batchGraph.addVertex(parts[0]);
}
Vertex dstVertex = batchGraph.getVertex(parts[1]);
if(dstVertex == null) {
dstVertex = batchGraph.addVertex(parts[1]);
}
Edge edge = batchGraph.addEdge(null, srcVertex, dstVertex, "similar");
batchGraph.commit();
}
lineCounter++;
}
reader.close();
}
Upvotes: 0
Views: 817
Reputation: 9060
With OrientDB you can optimize this importing in 2 ways:
So open the graph using OrientGraphNoTx instead of OrientGraph, then try this snippet:
OrientVertex srcVertex = orientGraph.addVertex(null, "nodeId", parts[0] );
OrientVertex dstVertex = orientGraph.addVertex(null, "nodeId", parts[1] );
Edge edge = orientGraph.addEdge(null, srcVertex, dstVertex, "similar");
Without calling .commit().
Upvotes: 1
Reputation: 46216
As you are trying to compare multiple databases I would recommend generalizing your code to Blueprints. The Flickr dataset looks like the right size for something like the BatchGraph Graph wrapper. With BatchGraph
you can tune your commit sizes and focus on the code to manage the loading. In that way, you can have one simple class to load all the different Graphs (you could even extend your test to other Blueprints-enabled Graphs easily).
@Stefan makes a good point about memory...you likely need to boost your -Xmx
settings on the JVM to deal with that data. Each Graph handles memory differently (even though they are persisting to disk) and if you are loading all three at once in the same JVM I could bet there is some contention there somewhere.
If you plan to go bigger than the Flickr dataset you referenced, then BatchGraph
might not be right. BatchGraph
is generally good to a few hundred million graph elements. When you start talking about graphs larger than that then you might want to forget some of what I said about trying to be non-graph specific. You will likely want to use the best tool for the job for each graph you want to test. For Neo4j, that means Neo4jBatchGraph (at least that way you are still using Blueprints if that's important to you), for Titan that means Faunus or a custom written parallel batch loader and for OrientDB OrientBatchGraph
Upvotes: 1
Reputation: 39915
This answer just covers the Neo4j part.
You're basically running the full import in a single transaction. A transaction is build up in memory and committed to disc. Depending on the size of the data to be imported this might be a reason for the OOME. To handle this, I see 3 options:
1) use Neo4j batch inserter. This is a non transactional way to build up a Neo4j datastore. Since the other two snippets above don't use transactions I guess batch inserter is the best way to produce comparable results.
2) adopt memory parameters of your JVM
3) split up the transaction size. A typically good choice is to bundle 10k - 100k atomic operations into a transaction.
Side note: take a look at https://github.com/jexp/batch-import, this allows you to run a import directly from csv files without the need for java coding.
Upvotes: 2