John Li
John Li

Reputation: 55

Building relationships in Neo4j via Py2neo is very slow

We have 5 different types of nodes in database. Largest one has ~290k, the smallest is only ~3k. Each node type has an id field and they are all indexed. I am using py2neo to build relationship, but it is very slow (~ 2 relationships inserted per second)

I used pandas read from a relationship csv, iterate each row to create a relationship wrapped in transaction. I tried batch out 10k creation statements in one transaction, but it does not seem to improve the speed a lot.

Below is the code:

df = pd.read_csv(r"C:\relationship.csv",dtype = datatype, skipinitialspace=True, usecols=fields)
df.fillna('',inplace=True)

def f(node_1 ,rel_type, node_2):
    try:
        tx = graph.begin()
        tx.evaluate('MATCH (a {node_id:$label1}),(b {node_id:$label2}) MERGE (a)-[r:'+rel_type+']->(b)',
                    parameters = {'label1': node_1, 'label2': node_2})
        tx.commit()
    except Exception as e:
        print(str(e))

for index, row in df.iterrows():
    if(index%1000000 == 0):
        print(index)
    try:
        f(row["node_1"],row["rel_type"],row["node_2"])
    except:
        print("error index: " + index)

Can someone help me what I did wrong here. Thanks!

Upvotes: 1

Views: 944

Answers (2)

Nigel Small
Nigel Small

Reputation: 4495

If your aim is for better performance then you will need to consider a different pattern for loading these, i.e. batching. You're currently running one Cypher MERGE statement for each relationship and wrapping that in its own transaction in a separate function call.

Batching these by looking at multiple statements per transaction or per function call will reduce the number of network hops and should improve performance.

Upvotes: 0

cybersam
cybersam

Reputation: 67044

You state that there are "5 different types of nodes" (which I interpret to mean 5 node labels, in neo4j terminology). And, furthermore, you state that their id properties are already indexed.

But your f() function is not generating a Cypher query that uses the labels at all, and neither does it use the id property. In order to take advantage of your indexes, your Cypher query has to specify the node label and the id value.

Since there is currently no efficient way to parameterize the label when performing a MATCH, the following version of the f() function generates a Cypher query that has hardcoded labels (as well as a hardcoded relationship type):

def f(label_1, id_1, rel_type, label_2, id_2):
    try:
        tx = graph.begin()
        tx.evaluate(
                'MATCH' +
                '(a:' + label_1 + '{id:$id1}),' +
                '(b:' + label_2 + '{id:$id2}) ' +
                'MERGE (a)-[r:'+rel_type+']->(b)',
            parameters = {'id1': id_1, 'id2': id_2})
        tx.commit()
    except Exception as e:
        print(str(e))

The code that calls f() will also have to be changed to pass in both the label names and the id values for a and b. Hopefully, your df rows will contain that data (or enough info for you to derive that data).

Upvotes: 1

Related Questions