NumenorForLife
NumenorForLife

Reputation: 1746

Optimizing py2neo's cypher insertion

I am using py2neo to import several hundred thousand nodes. I've created a defaultdict to map neighborhoods to cities. One motivation was to more efficiently import these relationships having been unsuccessful with Neo4j's load tool.

Because the batch documentation suggests to avoid using it, I veered away from an implementation like the OP of this post. Instead the documentation suggests I use Cypher. However, I like the being able to create nodes from the defaultdict I have created. Plus, I found it too difficult importing this information as the first link demonstrates.

To reduce the speed of the import, should I create a Cypher transaction (and submit every 10,00) instead of the following loop?

for city_name, neighborhood_names in city_neighborhood_map.iteritems():
     city_node = graph.find_one(label="City", property_key="Name", property_value=city_name)
         for neighborhood_name in neighborhood_names:
              neighborhood_node = Node("Neighborhood", Name=neighborhood_name)
              rel = Relationship(neighborhood_node, "IN", city_node)
              graph.create(rel)

I get a time-out, and it appears to be pretty slow when I do the following. Even when I break up the transaction so it commits every 1,000 neighborhoods, it still processes very slowly.

tx = graph.cypher.begin()
statement = "MERGE (city {Name:{City_Name}}) CREATE (neighborhood { Name : {Neighborhood_Name}}) CREATE (neighborhood)-[:IN]->(city)"
for city_name, neighborhood_names in city_neighborhood_map.iteritems():
    for neighborhood_name in neighborhood_names:
        tx.append(statement, {"City_Name": city_name, "Neighborhood_Name": neighborhood_name})
tx.commit()

It would be fantastic to save pointers to each city so I don't need to look it up each time with the merge.

Upvotes: 0

Views: 809

Answers (1)

Martin Preusse
Martin Preusse

Reputation: 9369

It may be faster to do this in two runs, i.e. CREATE all nodes first with unique constraints (which should be very fast) and then CREATE the relationships in a second round.

Constraints first, use Labels City and Neighborhood, faster MATCH later:

graph.schema.create_uniqueness_constraint('City', 'Name')
graph.schema.create_uniqueness_constraint('Neighborhood', 'Name')

Create all nodes:

tx = graph.cypher.begin()

statement = "CREATE (:City {Name: {name}})"
for city_name in city_neighborhood_map.keys():
    tx.append(statement, {"name": city_name})

statement = "CREATE (:Neighborhood {Name: {name}})"
for neighborhood_name in neighborhood_names: # get all neighborhood names for this
    tx.append(statement, {name: neighborhood_name})

tx.commit()

Relationships should be fast now (fast MATCH due to constraints/index):

tx = graph.cypher.begin()
statement = "MATCH (city:City {Name: {City_Name}}), MATCH (n:Neighborhood {Name: {Neighborhood_Name}}) CREATE (n)-[:IN]->(city)"
for city_name, neighborhood_names in city_neighborhood_map.iteritems():
    for neighborhood_name in neighborhood_names:
        tx.append(statement, {"City_Name": city_name, "Neighborhood_Name": neighborhood_name})

tx.commit()

Upvotes: 2

Related Questions