Reputation: 499
I have data in the following structure:
{"id": "1", "name": "A. I. Lazarev", "org": "United States Department of State", "tags": [{"t": "Infrared"}, {"t": "Near-infrared spectroscopy"}, {"t": "Infrared astronomy"}, {"t": "Data collection"}], "pubs": [{"i": "1542417502", "r": 6}], }
{"id": "2", "name": "Stevan Spremo", "tags": [{"t": "Micro-g environment"}, {"t": "Antibiotics"}, {"t": "Bacteriology"}], "pubs": [{"i": "222163962", "r": 0}], }
{"id": "3", "name": "Bricchi G", "pubs": [{"i": "2417067698", "r": 1}, {"i": "2406980973", "r": 1}]}
Some of the rows have tags, some have organizations, some have both, and some have neither.
I'd like to add relationships between (1) authors and tags, (2) authors and organizations, and (3) authors and publications. I have the publications as nodes already, so it should be fairly straightforward to get (3) once I get (1) and (2).
I have been trying to use the following code:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:/test.txt') YIELD value AS q RETURN q",
"UNWIND q.id as id
CREATE (a:Author {id:id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {{name: tags.t}})
CREATE (a)-[:HAS_TAGS]->(t)
WITH q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false})
The tags and the organizations show up multiple times in the data, but should only have one node each, so I have used MERGE
to create unique nodes for these.
The problem with the following code is that it creates duplicate AFFILIATED_WITH
relationships - it actually creates the same number of AFFILIATED_WITH
relationships as there are tags.
How can I change the cypher query so that it isn't creating duplicate relationships?
Upvotes: 0
Views: 178
Reputation: 67019
After this clause:
UNWIND q.tags as tags
your query will have as many data rows as the number of tags for the current q
(each row will have q, a, id, tags
values). The subsequent operations will be performed once per data row. That is why you are creating too many AFFILIATED_WITH
relationships.
To solve your issue, you have to reduce the number of data rows appropriately, at the appropriate time (and this will also speed up your processing, since unnecessarily repeated operations will be avoided). In your case, you can just change the second WITH q, a
clause to WITH DISTINCT q, a
:
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///test.txt') YIELD value AS q RETURN q",
"CREATE (a:Author {id:q.id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {name: tags.t})
CREATE (a)-[:HAS_TAGS]->(t)
WITH DISTINCT q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false}
)
I have also simplified the query by removing the unnecessary UNWIND q.id as id
clause, and fixed some syntax issues.
[UPDATED]
If you want to add the AUTHORED
relationships (as requested in the comments to this answer), you should do that before you create the AFFILIATED_WITH
relationships -- since the WHERE q.org is not null
clause would filter out some q
nodes. Also, whenever you use CREATE
to create a relationship, Cypher requires that you specify a direction for the relationship.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('file:///test.txt') YIELD value AS q RETURN q",
"CREATE (a:Author {id:q.id, name:q.name, citations:q.n_citation, publications:q.n_pubs})
WITH q, a
UNWIND q.tags as tags
MERGE (t:Tag {name: tags.t})
CREATE (a)-[:HAS_TAGS]->(t)
WITH DISTINCT q, a
UNWIND q.pubs as pubs
MERGE (p:Quanta {id: pubs.i})
CREATE (a)-[r:AUTHORED {rank: pubs.r}]->(p)
WITH q, a
WHERE q.org is not null
MERGE (o:Organization {name: q.org})
CREATE (a)-[:AFFILIATED_WITH]->(o)",
{batchSize:10000, iterateList:true, parallel:false}
)
Upvotes: 3