Reputation: 40136
After playing around with toy datasets, this was my first attempt to use data that is relevant for a project at work. In short, after limping to get nearly all of my data into Neo4j, my last query simply stalled. See the screenshot.
Note: I was prototyping my queries by pasting them into the browser tool, but my longer term plan was to keep all of the commands in a .cql
file that I could script on my workstation in order to perform nightly analyses.
To add context to my problem, I am prototyping on my macbook.
The files I am processing (rows/columns). I am not importing every column, it was just easier to keep my current datasets in check.
The only config options that I set manually for neo4j were in neo4j-wrapper.conf
where I set wrapper.java.initmemory
and wrapper.java.maxmemory
to 4096
. I did this after poking around to find similar problems.
I made these changes out of the gate because within the browser, I was getting error messages that the database was disconnected while processing my queries.
Lastly, because my data are work-related, I can't provide test data. I can, however, link to my cypher queries.
Constraint and LOAD CSV .cql file
Any help and advice would be greatly appreciated. I am pretty confident this is user error on my end, but I have definitely hit the road with respect to what my next steps would be.
Upvotes: 1
Views: 589
Reputation: 41676
I would split this one up, into creating nodes once and creating relationships (each) second:
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///Users/btibert/Dropbox/Projects/bentley-search-neo4j/data/templates.csv" AS row
WITH row
MATCH (r:Vendor {name:row.vendor})
WITH row, r
MERGE (p:Template {name:row.template_clean})
MERGE (v:Version {version:row.template_ver})
MERGE (p)-[:FROM_VERSION]->(v)
MERGE (p)-[:CREATED_BY]->(r);
As you can clearly see the Eager operation in the plan.
I mean it doesn't matter if you just have a few thousand rows. But if it goes towards many hundred thousand or millions then pulling all data in takes more memory.
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
| Operator | Identifiers | Other |
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
| EmptyResult | | |
| UpdateGraph(0) | anon[270], anon[301], p, r, row, v | MergePattern |
| UpdateGraph(1) | anon[270], p, r, row, v | MergePattern |
| UpdateGraph(2) | p, r, row, v | MergeNode; row.template_clean; :Template(name); MergeNode; row.template_ver; :Version(version) |
| Eager | r, row | |
| SchemaIndex | r, row | row.vendor; :Vendor(name) |
| LoadCSV | row | |
+----------------+------------------------------------+------------------------------------------------------------------------------------------------+
I would probably change this into an ON CREATE SET
variant for the non-key properties:
Also if you have multiple rows per student you can use WITH DISTINCT toInt(row.pidm) as pidm, ....
to reduce the number of merges it has to run.
LOAD CSV WITH HEADERS FROM "recs.csv" AS row
WITH row
MERGE (s:Student {pidm:toInt(row.pidm)})
ON CREATE SET s.hash_pidm=toInt(row.hash_pidm), ....;
This one I'd split up into two statements, one for each relationship, otherwise you might get too many matches:
(And you don't need the WITH
s in between)
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
WITH row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (v:Vendor {name: row.vendor} )
MATCH (a:Ability {name: row.ability} )
WITH row, s, v, a
MERGE (s)-[:PURCHASED_FROM]->(v)
MERGE (s)-[:HAS_ABILITY]->(a);
would become:
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (v:Vendor {name: row.vendor} )
MERGE (s)-[:PURCHASED_FROM]->(v);
LOAD CSV WITH HEADERS FROM "...recs.csv" AS row
MATCH (s:Student {pidm: toInt(row.pidm)} )
MATCH (a:Ability {name: row.ability} )
MERGE (s)-[:HAS_ABILITY]->(a);
Here I would also create the contacts on themselves. (Again with ON CREATE SET) And do the student relationship in a separate statement:
LOAD CSV WITH HEADERS FROM "....cont.csv" AS row
MERGE (c:Contact {cid:row.cid}) ON CREATE SET ....;
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
MATCH (s:Student {pidm:toInt(row.pidm)} )
MATCH (c:Contact {cid:row.cid})
MERGE (s)-[:HAS_CONTACT]->(c);
I would also split this one up into two statements:
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
WITH row WHERE toInt(row.seqnum) = 1
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (f:Contact {cid:row.first_cont})
MERGE (s)-[:FIRST]->(f);
LOAD CSV WITH HEADERS FROM "...cont.csv" AS row
WITH row WHERE toInt(row.seqnum) = 1
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (l:Contact {cid:row.last_cont})
MERGE (s)-[:LAST]->(l);
Split this one up into E-Mail creation and then later connecting it to the student by msg-id:
LOAD CSV WITH HEADERS FROM "...brm.csv" AS row
MERGE (e:Email {msgid:row.msgid}) ON CREATE SET ... ;
LOAD CSV WITH HEADERS FROM "file:///Users/btibert/Dropbox/Projects/bentley-search-neo4j/data/brm.csv" AS row
MATCH (s:Student {pidm:toInt(row.pidm)})
MATCH (e:Email {msgid:row.msgid})
MERGE (s)-[:WAS_SENT]->(e);
HTH Michael
Upvotes: 2