David
David

Reputation: 16120

How to bulk insert relationships

I'm playing with Neo4j. I have a database with around 400,000 nodes. I would like to insert relationships from a CSV file. There are about 1.4 million relationships.

I'm using the REST API at present. The REST requests look like this example:

POST http://localhost:7474/db/data/cypher
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"query": "MATCH (a { ConceptId: '280844000' }), (b { ConceptId: '71737002' }) CREATE (a)-[:Is_a]->(b) RETURN a"}

The problem is that each request is taking a couple of seconds. This is too slow for the amount of relationships I'm hoping to insert.

I don't have access to the underlying node IDs, just the properties I gave them when I inserted them.

Is there a faster way of doing this?

NB: I'm not using indexes at present (I haven't worked out how to add them), but will try again with indexes tomorrow. I just want to know whether there is a way of inserting relationships in bulk somehow.

Upvotes: 3

Views: 1628

Answers (2)

jjaderberg
jjaderberg

Reputation: 9952

The first improvement is probably to assign labels to your nodes so that you can use indices. Without an index on conceptId, each time your query is executed it will scan the 400,000 nodes twice, once for each of the two nodes you are matching. Speculating based on your query you could give your nodes the label :Concept and index the conceptId property as follows

MATCH (n)
// WHERE HAS (n.conceptId) //if you have some nodes that don't represent concepts, and conceptId distinguishes the ones that do from others
SET n:Concept

then for the index

CREATE INDEX ON :Concept(conceptId)

or if conceptId is a uniquely identifying value you can use a constraint instead

CREATE CONSTRAINT ON (c:Concept) ASSERT c.conceptId IS UNIQUE

Once you have set labels and created indices you can use them to quickly look up the nodes you are connecting. All you need to do is to include the label and indexed property in your query. You already use the indexed property, so adding the label your query becomes

MATCH (a:Concept {ConceptId: '280844000'}), (b:Concept {ConceptId: '71737002'}) 
CREATE (a)-[:Is_a]->(b) 
RETURN a

You can read more about schema in the Neo4j documentation.

The second improvement would probably be to use LOAD CSV as @stephenmuss suggests.

If you have queries in the future that are not based on a csv file, there are two more things to consider. The first is to parameterize your queries. Your HTTP call would then look something like this

POST http://localhost:7474/db/data/cypher
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"query": "MATCH (a { ConceptId: {a} }), (b { ConceptId: {b} }) CREATE (a)-[:Is_a]->(b) RETURN a","params":{"a":"280844000","b":"71737002"}}

This allows the execution engine to create the execution plan once, for the first query of that structure. Next time you issue a query with the same structure, the cached execution plan is reused. This will significantly increase performance for repeated queries with the same structure.

Last thing is along the lines of @ulkas comment, to insert in bulk. One reason LOAD CSV is faster is that it performs several operations in one transaction. You can do something similar using the transactional cypher endpoint. You can then execute a few thousand small statements per transaction, which is significantly more performant for operating on the database, and also will reduce overhead over the wire. It is slightly more complicated to design the payload for the transactional endpoint and also to handle exceptions. A simple example below, you can read more about it in the Neo4j manual pages.

POST http://localhost:7474/db/data/transaction
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"statements":[
    {"statement":"MATCH (a:Concept {ConceptId: {a}}), (b:Concept {ConceptId: {b}}) CREATE (a)-[:Is_a]->(b) RETURN a","parameters":{"a":"280844000","b":"71737002"}},
    {"statement":"MATCH (a:Concept {ConceptId: {a}}), (b:Concept {ConceptId: {b}}) CREATE (a)-[:Is_a]->(b) RETURN a","parameters":{"a":"199401294","b":"51233509"}}
]}

The server returns location of the new transaction, say "http://localhost:7474/db/data/transaction/1". You can continue to execute statements within the same transaction

POST http://localhost:7474/db/data/transaction/1
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"statements":[...]}

When you're done you commit. The commit call can also contain statements.

POST http://localhost:7474/db/data/transaction/1/commit
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"statements":[...]}

Upvotes: 9

stephenmuss
stephenmuss

Reputation: 2445

If you are using Neo4j 2.1+ I think your best option would be to use LOAD CSV.

You can then use syntax like the following

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/path/to/file.csv" AS csvLine
MATCH (a{ConceptId: csvLine.aId}), (b{ConceptId: csvLine.bId})
CREATE (a)-[:Is_a]->(b)

I suggest checking out the docs for importing csv files.

Upvotes: 3

Related Questions