Oleg Mirzov
Oleg Mirzov

Reputation: 155

Multi-threaded node creation in Neo4j

I created 1 million Neo4j nodes in batches of 10000, each batch in its own transaction. The strange thing is that parallelizing this process with multi-threaded execution did not have any positive effect on performance. It is as if the transactions in different threads are blocking each other.

Here's a piece of Scala code that tests this with the help of parallel collections:

import org.neo4j.kernel.EmbeddedGraphDatabase

object Main extends App {

    val total = 1000000
    val batchSize = 10000

    val db = new EmbeddedGraphDatabase("neo4yay")

    Runtime.getRuntime().addShutdownHook(
        new Thread(){override def run() = db.shutdown()}
    )

    (1 to total).grouped(batchSize).toSeq.par.foreach(batch => {

        println("thread %s, nodes from %d to %d"
            .format(Thread.currentThread().getId, batch.head, batch.last))

        val transaction = db.beginTx()
        try{
            batch.foreach(db.createNode().setProperty("Number", _))
        }finally{
            transaction.finish()
        }
    })
}

and here are the build.sbt lines needed for building and running it:

scalaVersion := "2.9.2"

libraryDependencies += "org.neo4j" % "neo4j-kernel" % "1.8.M07"

fork in run := true

One can switch between parallel and sequential modes by removing and adding .par invocation before the outer foreach. The console output clearly shows then that with .par execution is indeed multi-threaded.

To rule out possible problems with concurrency in this code, I have also tried an actor-based implementation, with about the same result (6 and 7 seconds for sequential and parallel versions, respectively).

So, the question is: did I do something wrong or this is a Neo4j limitation? Thanks!

Upvotes: 4

Views: 1460

Answers (2)

Michael Hunger
Michael Hunger

Reputation: 41706

The main issue is that your tx arrive at about the same time. And transaction commits are serialized writes to the transaction log. If the writes would be interleaved time-wise and the actual node-creation a more expensive process you would get a speedup.

Upvotes: 4

Jan
Jan

Reputation: 1777

Batch insert does not work with multiple threads. From the neo4j Documentation:

Always perform batch insertion in a single thread (or use synchronization to make only one thread at a time access the batch inserter) and invoke shutdown when finished.

Neo4j Batch insert

Upvotes: 2

Related Questions