Reputation: 331
I'm currently building an app that models various geographic features (roads, towns, highways, etc.) in a graph database. The geographic data is all in GeoJSON format.
There is no LOAD JSON function in the cypher language, so loading JSON files requires passing the fully parsed JavaScript object as a parameter and using UNWIND to access arrayed properties and objects to create nodes. (This guide helped me out a lot to get started: Loading JSON in neo4j). Since GeoJSON is just a spec built on JSON conventions, the load JSON method works great for reasonably sized files.
However, geographic data files can be massive. Some of the files I'm trying to import range from 100 features to 200,000 features.
The problem I'm running into is that with these very large files, the query will not MERGE any nodes in the database until it has completely processed the file. For large files, this often exceeds the 3600s timeout limit set in neo4j. So I end up waiting for an hour to find out that I have no new data in my database.
I know that with some data, the current recommendation is to convert it to CSV and then use the optimization of LOAD CSV. However, I don't believe it is easy to condense GeoJSON into CSV.
Is it possible to send the data from a very large JSON/GeoJSON file over in smaller batches so that neo4j will commit the data intermittently?
To import my data, I built a simple Express app that connects to my neo4j database via the bolt protocol (using official binary JS drivers). My GeoJSON files all have a well known text (WKT) property for each feature so that I can make use of neo4j-spatial.
Here's an example of the code I would use to import a set of road data:
session.run("WITH {json} as data UNWIND data.features as features MERGE (r:Road {wkt:features.properties.wkt})", {json: jsonObject})
.then(function (result) {
var records = [];
result.records.forEach((value) => {
records.push(value);
});
console.log("query completed");
session.close();
driver.close();
return records;
})
.catch((error) => {
console.log(error);
// Close out the session objects
session.close();
driver.close();
});
As you can see I'm passing in the entire parsed GeoJSON object as a parameter in my cypher query. Is there a better way to do this with very large files to avoid the timeout issue I'm experiencing?
Upvotes: 6
Views: 1548
Reputation: 1723
This answer might be helpful here: https://stackoverflow.com/a/59617967/1967693
apoc.load.jsonArray() will stream the values of the given JSON file. This can then be used as data source for batching via apoc.periodic.iterate.
CALL apoc.periodic.iterate(
"CALL apoc.load.json('https://dummyjson.com/products', '$.features') YIELD value AS features",
"UNWIND features as feature MERGE (r:Road {wkt:feature.properties.wkt})",
{batchSize:1000, parallel:true}
)
Upvotes: 0