Reputation: 16120
I'm playing with Neo4j. I have a database with around 400,000 nodes. I would like to insert relationships from a CSV file. There are about 1.4 million relationships.
I'm using the REST API at present. The REST requests look like this example:
POST http://localhost:7474/db/data/cypher
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"query": "MATCH (a { ConceptId: '280844000' }), (b { ConceptId: '71737002' }) CREATE (a)-[:Is_a]->(b) RETURN a"}
The problem is that each request is taking a couple of seconds. This is too slow for the amount of relationships I'm hoping to insert.
I don't have access to the underlying node IDs, just the properties I gave them when I inserted them.
Is there a faster way of doing this?
NB: I'm not using indexes at present (I haven't worked out how to add them), but will try again with indexes tomorrow. I just want to know whether there is a way of inserting relationships in bulk somehow.
Upvotes: 3
Views: 1628
Reputation: 9952
The first improvement is probably to assign labels to your nodes so that you can use indices. Without an index on conceptId
, each time your query is executed it will scan the 400,000 nodes twice, once for each of the two nodes you are matching. Speculating based on your query you could give your nodes the label :Concept
and index the conceptId
property as follows
MATCH (n)
// WHERE HAS (n.conceptId) //if you have some nodes that don't represent concepts, and conceptId distinguishes the ones that do from others
SET n:Concept
then for the index
CREATE INDEX ON :Concept(conceptId)
or if conceptId
is a uniquely identifying value you can use a constraint instead
CREATE CONSTRAINT ON (c:Concept) ASSERT c.conceptId IS UNIQUE
Once you have set labels and created indices you can use them to quickly look up the nodes you are connecting. All you need to do is to include the label and indexed property in your query. You already use the indexed property, so adding the label your query becomes
MATCH (a:Concept {ConceptId: '280844000'}), (b:Concept {ConceptId: '71737002'})
CREATE (a)-[:Is_a]->(b)
RETURN a
You can read more about schema in the Neo4j documentation.
The second improvement would probably be to use LOAD CSV
as @stephenmuss suggests.
If you have queries in the future that are not based on a csv file, there are two more things to consider. The first is to parameterize your queries. Your HTTP
call would then look something like this
POST http://localhost:7474/db/data/cypher
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"query": "MATCH (a { ConceptId: {a} }), (b { ConceptId: {b} }) CREATE (a)-[:Is_a]->(b) RETURN a","params":{"a":"280844000","b":"71737002"}}
This allows the execution engine to create the execution plan once, for the first query of that structure. Next time you issue a query with the same structure, the cached execution plan is reused. This will significantly increase performance for repeated queries with the same structure.
Last thing is along the lines of @ulkas comment, to insert in bulk. One reason LOAD CSV
is faster is that it performs several operations in one transaction. You can do something similar using the transactional cypher endpoint. You can then execute a few thousand small statements per transaction, which is significantly more performant for operating on the database, and also will reduce overhead over the wire. It is slightly more complicated to design the payload for the transactional endpoint and also to handle exceptions. A simple example below, you can read more about it in the Neo4j manual pages.
POST http://localhost:7474/db/data/transaction
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"statements":[
{"statement":"MATCH (a:Concept {ConceptId: {a}}), (b:Concept {ConceptId: {b}}) CREATE (a)-[:Is_a]->(b) RETURN a","parameters":{"a":"280844000","b":"71737002"}},
{"statement":"MATCH (a:Concept {ConceptId: {a}}), (b:Concept {ConceptId: {b}}) CREATE (a)-[:Is_a]->(b) RETURN a","parameters":{"a":"199401294","b":"51233509"}}
]}
The server returns location
of the new transaction, say "http://localhost:7474/db/data/transaction/1"
. You can continue to execute statements within the same transaction
POST http://localhost:7474/db/data/transaction/1
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"statements":[...]}
When you're done you commit. The commit call can also contain statements.
POST http://localhost:7474/db/data/transaction/1/commit
Accept: application/json; charset=UTF-8
Content-Type: application/json
{"statements":[...]}
Upvotes: 9
Reputation: 2445
If you are using Neo4j 2.1+ I think your best option would be to use LOAD CSV
.
You can then use syntax like the following
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/path/to/file.csv" AS csvLine
MATCH (a{ConceptId: csvLine.aId}), (b{ConceptId: csvLine.bId})
CREATE (a)-[:Is_a]->(b)
I suggest checking out the docs for importing csv files.
Upvotes: 3