neo4j apoc.periodic.rock_n_roll() performance

Question

I have 130 million nodes with label Snp. I want to convert the property position from string to int for all nodes. I'm using neo4j 3.0.4 with apoc version 3.0.4.1.

Due to the large number of nodes this has to be done in batches. I tried the apoc.periodic.rock_n_roll() procedure for this

CALL apoc.periodic.rock_n_roll(
    'MATCH (n:Snp) WITH n RETURN id(n) AS id_n', 
    'MATCH (n:Snp) where id(n)={id_n} SET n.position = toInt(n.position)',
    20000
)

I thought this matches all nodes in batches and then calls the second query for each batch. But it blocks neo4j with frequent GC and growing memory usage. The procedure has not finished in 3 hours.

It works if the first MATCH is limited, the following takes ~20 seconds:

CALL apoc.periodic.rock_n_roll(
    'MATCH (n:Snp) WITH n LIMIT 1000000 RETURN id(n) AS id_n', 
    'MATCH (n:Snp) where id(n)={id_n} SET n.position = toInt(n.position)',
    20000
)

However, this is not the point of the procedure I think. Can I somehow use it differently to convert a property for a large set of nodes?

cybersam · Accepted Answer

The first Cypher statement you pass to the apoc.periodic.rock_n_roll procedure will attempt to get all 130 million Snp nodes. This is probably why you are seeing high memory usage and slow processing. Batch processing is only performed on the second Cypher statement.

The apoc.periodic.commit procedure should work better for your use case. The following call will get and convert 100K nodes at a time until they have all been processed.

CALL apoc.periodic.commit(
  'MATCH (n:Snp) WHERE TOINT(n.position) <> n.position WITH n LIMIT {limit} SET n.position = TOINT(n.position) RETURN COUNT(*);',
  {limit: 100000}
)

The apoc.periodic.commit procedure repeatedly invokes its Cypher query until it returns 0. The MATCH clause filters out nodes that already have an integer position. The limit parameter specifies the batch size.

neo4j apoc.periodic.rock_n_roll() performance

Answers (1)

Related Questions