Reputation: 6364
I have one vertex like this
Vertex1
{
name:'hello',
id: '2',
key: '12345',
col1: 'value1',
col2: 'value2',
.......
}
Vertex2, Vertex3, ..... Vertex200K
{
name:'hello',
id: '1',
key: '12345',
col1: 'value1',
col2: 'value2',
.......
}
Cypher Query
MATCH (a:Dense1) where a.id <> "1"
WITH a
MATCH (b:Dense1) where b.id = "1"
WITH a,b
WHERE a.key = b.key
MERGE (a)-[:PARENT_OF]->(b)
The end result should be Vertex1 should have a degree of 200K, therefore, there should be 200K relationships. However, the above query takes a lot of time pretty much killing the throughput to 500/second. Any ideas on how to create relationships/edges quicker?
When I run the profile and the cypher query above it keeps running forever and doesn't return so I reduced the size from 200K to 20K and here is what the profile is showing me.
Upvotes: 1
Views: 458
Reputation: 843
Can you try running
MATCH (b:Dense1) where b.id <> "1"
WITH b, b.key AS bKey
MATCH (a:Dense1) where a.id = "1" AND a.key = bKey
MERGE (a)-[:PARENT_OF]->(b)
after ensuring you have indexes on id and key ?
Also, do I get this right that id is NOT unique, and you have 1 node with id=2 and 200k with id = 1? If I got this wrong flip the condition to make the first line return single node, one you want to have all relations coming into, and second part matching all the remaining 200k nodes. Also, in the merge, put the low-density node as the first one (so here, b would get 200k relationships in) - if that's not right, reverse it to be (b) <-[:XXX]-(a).
It's been a while since I was dealing with large imports/merges, but I recall that extracting the variable explicitly (e.g. bKey) that can then be matched in index, and starting from single nodes (single, or just a few b's) moving onto higher (multiple a's) was working better than queries with where clauses similar to a.key = b.key.
Having said that, 200k relationships in one transaction, AND connected to single node, is a lot, since just matching on the index finds the nodes quickly, but still you need to validate all outgoing relationships to see if by chance they already link to the other node. So, by the time you create your last relationship, you need to iterate/check nearly 200k relationships.
One trick is running batches in a loop until nothing gets created, e.g.
MATCH (b:Dense1) where b.id = "1"
WITH b, b.key AS bKey
MATCH (a:Dense1) where a.id <> "1" AND a.key = bKey
AND NOT (a) -[:PARENT_OF]-> (b) WITH a,b LIMIT 10000
MERGE (a)-[:PARENT_OF]->(b)
This might show you probably that the further the batch, the longer it takes - makes sense logically, as more and more relationships out of b need to be checked the further you go.
Or, as shown in other responses, batch via APOC.
Last thing - is this supposed to be ongoing process or one-time setup / initialisation of the DB? There are more, dramatically faster options if it's for initial load only.
Upvotes: 0
Reputation: 30397
Given your memory constraints, and the high db hits associated with your MERGE of the relationships, the issue is likely that you're trying to MERGE 200k relationships in a single transaction. You should probably batch this by using apoc.periodic.iterate() from APOC Procedures:
CALL apoc.periodic.iterate("
MATCH (a:Dense1)
WHERE a.id <> '1'
MATCH (b:Dense1)
WHERE b.id = '1' AND a.key = b.key
RETURN a, b",
"MERGE (a)-[:PARENT_OF]->(b)",
{}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
This should batch those merges 10k at a time.
Also, if you happen to know for a fact that those relationships don't yet exist, use CREATE instead of MERGE, it will be faster.
Upvotes: 1
Reputation: 4052
Create an Index on the properties you are using for matching.
Here id
and key
properties.
You can create an index with the following queries:
CREATE INDEX ON :Schema1(id);
CREATE INDEX ON :Schema1(key);
This is the first step to improve performance. You can further improve with a few other tricks.
Upvotes: 0