Reputation: 884
I am trying to build a simple relationship in Neo4j using Spark-Neo4j connector. My dataframe looks like this:
df_new= spark.createDataFrame(
[("CompanyA",'A','CompanyA','B'),("CompanyB",'B','CompanyB','C') ],
["name",'gid','description','parent_gid']
)
The desired tree should look like this:
The query I wrote looks like this:
query = """
MERGE (c:Company {gid:event.gid})
ON CREATE SET c.name=event.name, c.description=event.description
ON MATCH SET c.name=event.name, c.description=event.description
MERGE (p:Company {gid:event.parent_gid})
MERGE (p)-[:PARENT_OF]->(c)
"""
df_new.write\
.mode("Overwrite")\
.format("org.neo4j.spark.DataSource")\
.option("url", "bolt://localhost:7687")\
.option("authentication.type", "basic")\
.option("authentication.basic.username", username)\
.option("authentication.basic.password", password)\
.option("query", query)\
.save()
However my code ends up creating node instead of merging it, and I end up with two nodes for company B
Upvotes: 1
Views: 801
Reputation: 655
You have the exact right logic, there's just some nuance at play that is hard to pin down. This article has your answer; read the section near the end about unique constraints: https://neo4j.com/developer/kb/understanding-how-merge-works/
One solution is to change your query to this:
query = '''
merge (c:Company {gid:event.gid})
set c.name = event.name, c.description = event.description
merge (p:Company {gid:event.parent_gid})
set p.name = event.name, p.description = event.description
merge (p)-[:PARENT_OF]->(c)
'''
Now when performing concurrent operations, cypher has enough unique constraints to avoid duplicating gid = "B"
Upvotes: 0