Munchkin
Munchkin

Reputation: 4956

fast way to execute multiple CREATE statements

I access a neo4J database via Java and I want to create 1,3 million nodes. Therefore I create 1,3 million "CREATE" statements. As I figured out, the query is way too long. I only can execute ~100 CREATE statements per query - otherwise the query fails:

Client client;
WebResource cypher;
String request;
ClientResponse cypherResponse;
String query = "";
int nrQueries = 0;

for(HashMap<String, String> entity : entities){
    nrQueries++;
    query += " CREATE [...] ";

    if(nrQueries%100==0){
        client = Client.create();
        cypher = client.resource(SERVER_ROOT_URI + "cypher");
        request = "{\"query\":\""+query+"\"}";
        cypherResponse = cypher.accept(MediaType.APPLICATION_JSON).post(ClientResponse.class, request);
        cypherResponse.close();
        query = "";
    }
}

Well, as I want to execute 1,3 million queries and I only can combine 100 into one request, I still have 13,000 requests, which take a long time. Is there a way to do it faster?

Upvotes: 2

Views: 313

Answers (1)

FrobberOfBits
FrobberOfBits

Reputation: 18022

You have two other options you should be considering: the import tool and the LOAD CSV option.

The right question here is "how to put data into neo4j fast" rather than "how to execute a lot of CREATE statements quickly". Both of these options will be way faster than doing individual CREATE statements, so I wouldn't mess with individual CREATEs anymore.

Michael Hunger wrote a great blog post describing multiple facets of importing data into neo4j, you should check out if you want to understand more why those are good options, not just that they are good options.

The LOAD CSV option is going to do exactly what the name suggests. You'll basically use the cypher query language to load data directly from files, and it goes substantially faster because you commit the records in "batches" (the documentation describes this). So you're still using transactions to get your data in, you're just doing it faster, in batches, and while being able to create complex relationships along the way.

The import tool is similar, except it's for very high performance creates of large volumes of data. The magic here (and why it's so fast) is that it skips the transaction layer. This is both a good thing and a bad thing, depending on your perspective (Michael Hunger's blog post I believe explains the tradeoffs).

Without knowing your data, it's hard to make a specific recommendation - but as a generality, I'd say start with LOAD CSV as a default, and move to the import tool if and only if the volume of data is really big, or your insert performance requirements are really intense. This reflects a slight bias on my part that transactions are a good thing, and that staying at the cypher layer (rather than using a separate command line tool) is also a good thing, but YMMV.

Upvotes: 1

Related Questions