user1870400
user1870400

Reputation: 6364

creating 200K relationships to a node is taking a lot of time in Neo4J 3.5?

I have one vertex like this

Vertex1

{
    name:'hello',
    id: '2',
    key: '12345',
    col1: 'value1',
    col2: 'value2',
    .......
}

Vertex2, Vertex3, ..... Vertex200K

{
    name:'hello',
    id: '1',
    key: '12345',
    col1: 'value1',
    col2: 'value2',
    .......
}

Cypher Query

MATCH (a:Dense1) where a.id <> "1" 
WITH a 
MATCH (b:Dense1) where b.id = "1" 
WITH a,b 
WHERE a.key = b.key 
MERGE (a)-[:PARENT_OF]->(b)

The end result should be Vertex1 should have a degree of 200K, therefore, there should be 200K relationships. However, the above query takes a lot of time pretty much killing the throughput to 500/second. Any ideas on how to create relationships/edges quicker?

When I run the profile and the cypher query above it keeps running forever and doesn't return so I reduced the size from 200K to 20K and here is what the profile is showing me.

enter image description here

Upvotes: 1

Views: 458

Answers (3)

Lili
Lili

Reputation: 843

Can you try running

MATCH (b:Dense1) where b.id <> "1" 
WITH b, b.key AS bKey
MATCH (a:Dense1) where a.id = "1" AND a.key = bKey 
MERGE (a)-[:PARENT_OF]->(b)

after ensuring you have indexes on id and key ?

Also, do I get this right that id is NOT unique, and you have 1 node with id=2 and 200k with id = 1? If I got this wrong flip the condition to make the first line return single node, one you want to have all relations coming into, and second part matching all the remaining 200k nodes. Also, in the merge, put the low-density node as the first one (so here, b would get 200k relationships in) - if that's not right, reverse it to be (b) <-[:XXX]-(a).

It's been a while since I was dealing with large imports/merges, but I recall that extracting the variable explicitly (e.g. bKey) that can then be matched in index, and starting from single nodes (single, or just a few b's) moving onto higher (multiple a's) was working better than queries with where clauses similar to a.key = b.key.

Having said that, 200k relationships in one transaction, AND connected to single node, is a lot, since just matching on the index finds the nodes quickly, but still you need to validate all outgoing relationships to see if by chance they already link to the other node. So, by the time you create your last relationship, you need to iterate/check nearly 200k relationships.

One trick is running batches in a loop until nothing gets created, e.g.

MATCH (b:Dense1) where b.id = "1" 
WITH b, b.key AS bKey
MATCH (a:Dense1) where a.id <> "1" AND a.key = bKey 
AND NOT (a) -[:PARENT_OF]-> (b) WITH a,b LIMIT 10000
MERGE (a)-[:PARENT_OF]->(b)

This might show you probably that the further the batch, the longer it takes - makes sense logically, as more and more relationships out of b need to be checked the further you go.

Or, as shown in other responses, batch via APOC.

Last thing - is this supposed to be ongoing process or one-time setup / initialisation of the DB? There are more, dramatically faster options if it's for initial load only.

Upvotes: 0

InverseFalcon
InverseFalcon

Reputation: 30397

Given your memory constraints, and the high db hits associated with your MERGE of the relationships, the issue is likely that you're trying to MERGE 200k relationships in a single transaction. You should probably batch this by using apoc.periodic.iterate() from APOC Procedures:

CALL apoc.periodic.iterate("
MATCH (a:Dense1) 
WHERE a.id <> '1'
MATCH (b:Dense1) 
WHERE b.id = '1' AND a.key = b.key 
RETURN a, b",
"MERGE (a)-[:PARENT_OF]->(b)",
{}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages 

This should batch those merges 10k at a time.

Also, if you happen to know for a fact that those relationships don't yet exist, use CREATE instead of MERGE, it will be faster.

Upvotes: 1

Rajendra Kadam
Rajendra Kadam

Reputation: 4052

Create an Index on the properties you are using for matching.

Here id and key properties.

You can create an index with the following queries:

CREATE INDEX ON :Schema1(id);
CREATE INDEX ON :Schema1(key);

This is the first step to improve performance. You can further improve with a few other tricks.

Upvotes: 0

Related Questions