Martin Buber
Martin Buber

Reputation: 135

How to use load csv for large dataset in neo4j?

I have a user.csv file with students:

id, first_name, last_name, locale, gender
1, Hasso, Plattner, en, male
2, Tina, Turner, de, female

and a memberships.csv file with course memberships of the students:

id, user_id, course_id
1, 1, 3
2, 1, 4
3, 2, 4
4, 2, 5

To transform students and courses into vertices and course memberberships into edges, I joined the user information into memberships.csv

id, user_id, first_name, last_name, course_id, locale, gender
1, 1, Hasso, Plattner, 3, en, male
2, 1, Hasso, PLattner, 4, en, male
3, 2, Tina, Turner, 4, de, female
4, 2, Tina, Turner, 5, de, female

and used load csv, some constraints and MERGE:

create constraint on (g:Gender) assert g.gender is unique
create constraint on (l:locale) assert l.locale is unique
create constraint on (c:Course) assert c.course is unique
create constraint on (s:Student) assert s.student is unique

USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM
'file: memberships.csv'
AS line
MERGE (s:Student {id: line.id, name: line.first_name +" "+line.last_name })
MERGE (c:Course {id: line.course_id})
MERGE (g:Gender {gender:line.gender})
MERGE (l:locale {locale:line.locale})
MERGE (s)-[:HAS_GENDER]->(g)
MERGE (s)-[:HAS_LANGUAGE]->(l)
MERGE (s)-[:ENROLLED_IN]->(c)

For 1 000 memberships neo4j needs 2 seconds to load, for 10 000 memberships 3 minutes, for 100 000 it fails with 'Unknown error'.

i) How to get rid of the error? ii) Is there a more elegant way to load such a structure from .csv with about 600 000 memberships?

I am using a local machine with 2,4 GHz and 16GB RAM.

Upvotes: 2

Views: 446

Answers (2)

Michael Hunger
Michael Hunger

Reputation: 41676

Try to import first the nodes from their CSV and then the rels afterwards.

Also try to do an import run without Gender and Locale nodes and instead store it as a property.

If you really need those (dense) nodes later on, try to run it like this:

CREATE (g:Gender {gender:"male"})
MATCH (s:Student {gender:"male"})
CREATE (s)-[:HAS_GENDER]->(g)

Those relationships will be unique, and create is cheaper than MERGE. I assume that checking 2*(n-1) rels per inserted student adds up as it is then O(n^2)

Upvotes: 0

Kenny Bastani
Kenny Bastani

Reputation: 3308

The Neo4j browser has a 60 second timeout period on Cypher queries (due to HTTP transport). This does not mean that your query is not running to completion, in fact there has been no error at the database-level. Your query will continue to run via the browser but you will not be able to see its result. To see long running queries run to completion please use the Neo4j shell.

http://docs.neo4j.org/chunked/stable/shell.html

Upvotes: 0

Related Questions