mazelife
mazelife

Reputation: 2089

How to explain the performance of Cypher's LOAD CSV clause?

I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.

The cypher query being used is this:

USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)

Here's a couple of lines of the CSV:

227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048

Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.

CSV Lines       Time (msec)
------------------------------
500             480
1000            717
2000            1110
5000            1521
10000           2111
50000           4794
100000          5907
200000          12302
300000          35494
400000          Java heap space error

My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial: Graph of load time by number of CSV lines

What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.

Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:

Indexes
  ON :DependencySet(label) ONLINE (for uniqueness constraint)
  ON :Sense(uid)           ONLINE (for uniqueness constraint)

Any explanations or thoughts on an alternative approach would be appreciated.

EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.

Upvotes: 4

Views: 1148

Answers (2)

Kenny Bastani
Kenny Bastani

Reputation: 3308

I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.

Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.

Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/

That should resolve your problem. I don't see anything wrong with your query.

Upvotes: 4

ulkas
ulkas

Reputation: 5918

i believe the line

MATCH (s:Sense {uid: toInt(row[4])})

makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).

maybe if you could post you server memory settings, we could dig deeper.

to the problem of java heap error refer to Kenny's answer

Upvotes: 1

Related Questions