Benoit Verhaeghe
Benoit Verhaeghe

Reputation: 666

Importing json in neo4J

[PROBLEM - My final solution below]

I'd like to import a json file containing my data into Neo4J. However, it is super slow.

The Json file is structured as follow

{
    "graph": {
        "nodes": [
            { "id": 3510982, "labels": ["XXX"], "properties": { ... } },
            { "id": 3510983, "labels": ["XYY"], "properties": { ... } },
            { "id": 3510984, "labels": ["XZZ"], "properties": { ... } },
     ...
        ],
        "relationships": [
            { "type": "bla", "startNode": 3510983, "endNode": 3510982, "properties": {} },
            { "type": "bla", "startNode": 3510984, "endNode": 3510982, "properties": {} },
    ....
        ]
    }
}

Is is similar to the one proposed here: How can I restore data from a previous result in the browser?.

By looking at the answer. I discovered that I can use

CALL apoc.load.json("file:///test.json") YIELD value AS row
WITH row, row.graph.nodes AS nodes
UNWIND nodes AS node
CALL apoc.create.node(node.labels, node.properties) YIELD node AS n
SET n.id = node.id

and then

CALL apoc.load.json("file:///test.json") YIELD value AS row
with row
UNWIND row.graph.relationships AS rel
MATCH (a) WHERE a.id = rel.endNode
MATCH (b) WHERE b.id = rel.startNode
CALL apoc.create.relationship(a, rel.type, rel.properties, b) YIELD rel AS r
return *

(I have to do it in two times because else their are relation duplication due to the two unwind).

But this is super slow because I have a lot of entities and I suspect the program to search over all of them for each relation.

At the same time, I know "startNode": 3510983 refers to a node. So the question: does it exists anyway to speed up to import process using ids as index, or something else?

Note that my nodes have differents types. So I did not find a way to create an index for all of them, and I suppose that would be too huge (memory)

[MY SOLUTION - not efficient answer 1]

CALL apoc.load.json('file:///test.json') YIELD value
WITH value.graph.nodes AS nodes, value.graph.relationships AS rels
UNWIND nodes AS n
CALL apoc.create.node(n.labels, apoc.map.setKey(n.properties, 'id', n.id)) YIELD node
WITH rels, COLLECT({id: n.id, node: node, labels:labels(node)}) AS nMap
UNWIND rels AS r
MATCH (w{id:r.startNode})
MATCH (y{id:r.endNode})
CALL apoc.create.relationship(w, r.type, r.properties, y) YIELD rel
RETURN rel

[Final Solution in comment]

Upvotes: 1

Views: 3265

Answers (3)

Benoit Verhaeghe
Benoit Verhaeghe

Reputation: 666

The final answer that seems also efficient is the following one:

It is inspired by the solution and discussion of this answer https://stackoverflow.com/a/61464839/5257140

Major updates are:

  • I use apoc.map.fromPairs(COLLECT([n.id, node])) AS nMap to create a memory dictionary. Not that in this answer we use brackets and not curly brackets
  • We also added a where close to ensure proper working of the script even with some problems in the original data
  • The last return could be modified (for performance reasons in Neo4J Desktop or others that might display the super long results)
CALL apoc.load.json('file:///test-graph.json') YIELD value
WITH value.nodes AS nodes, value.relationships AS rels
UNWIND nodes AS n
CALL apoc.create.node(n.labels, apoc.map.setKey(n.properties, 'id', n.id)) YIELD node
WITH rels, apoc.map.fromPairs(COLLECT([n.id, node])) AS nMap
UNWIND rels AS r
WITH r, nMap[TOSTRING(r.startNode)] AS startNode, nMap[TOSTRING(r.endNode)] AS endNode
WHERE startNode IS NOT NULL and endNode IS NOT NULL 
CALL apoc.create.relationship(startNode, r.type, r.properties, endNode) YIELD rel
RETURN rel

Upvotes: 0

cybersam
cybersam

Reputation: 67019

[EDITED]

This approach may work more efficiently:

CALL apoc.load.json("file:///test.json") YIELD value
WITH value.graph.nodes AS nodes, value.graph.relationships AS rels
UNWIND nodes AS n
CALL apoc.create.node(n.labels, apoc.map.setKey(n.properties, 'id', n.id)) YIELD node
WITH rels, apoc.map.mergeList(COLLECT({id: n.id, node: node})) AS nMap
UNWIND rels AS r
CALL apoc.create.relationship(nMap[r.startNode], r.type, r.properties, nMap[r.endNode]) YIELD rel
RETURN rel

This query does not use MATCH at all (and does not need indexing), since it just relies on an in-memory mapping from the imported node ids to the created nodes. However, this query could run out of memory if there are a lot of imported nodes.

It also avoids invoking SET by using apoc.map.setKey to add the id property to n.properties.

The 2 UNWINDs do not cause a cartesian product, since this query uses the aggregating function COLLECT (before the second UNWIND) to condense all the preceding rows into one (because the grouping key, rels, is a singleton).

Upvotes: 2

David A Stumpf
David A Stumpf

Reputation: 793

Have you tried indexing the nodes before the LOAD JSON? This may not be tenable since you have multiple node labels. But if they are limited you can create placeholder node, create and index and then delete the placeholder. After this, run the LOAD Json

    Create (n:YourLabel{indx:'xxx'})
    create index on: YourLabel(indx)
    match (n:YourLabel) delete n

The index will speed the matching or merging

Upvotes: 0

Related Questions