Reputation: 7380
I have Neo4j installed on a windows machine with 12 processors and 64GB ram. I did not change any of the memory settings that Neo4j allows for.
My database has 3.8m nodes, 210,000 of which are labeled as Geotagged and a total of 650,000 relationships. I am trying to run the following query and I am wondering if this is a really intensive query that will likely take quite a while.
Messages.csv is my relationship file. The relationships have already been created, but as I could not figure out how to combine the relationship creation with the below Distance generation, I am loading and running through the relationship file twice.
USING PERIODIC COMMIT 15000
LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));
The initial relationship generation takes about 3-5 minutes. I let the above run for over an hour and it still was not complete. I ran a similar algorithm (though it had a few more trig calls in it) on the same initial db and let it run for over 18 hours and still had not completed.
My question: Is this a very intensive query? Am I not giving it enough time? And more importantly, is there a way I can optimize this?
I tried adding "WHERE NOT HAS(r.Distance)" to exclude node pairs that the algorithm has already set the Distance on, though I am unsure if the MATCH is a one-time match or if it will MATCH for each line in the CSV file?
Any thoughts on this would really be appreciated.
Upvotes: 2
Views: 211
Reputation: 39905
This is additional to Brian's reply:
Your statement's query plan shows EAGER
, to verify run
EXPLAIN explain LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 *asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));
Eagerness in LOAD CSV
is pretty bad, see the these blog posts why:
Following Mark's suggested and replacing the MATCH/SET
with a MERGE ON MATCH SET
we can refactor that into:
explain LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }), (b:Geotagged { username: line.recipient })
MERGE (a)-[r:MSGED]->(b)
ON MATCH SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));
And eager
has vanished.
Upvotes: 2
Reputation: 10856
One way that I would start to debug is to put a limit on it using WITH
:
USING PERIODIC COMMIT 15000
LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));
With that you can change the LIMIT
number to see how the performance degrades as the limit increases.
Also, is the username
property indexes for the Geotagged
label? If not it definitely should be, like this:
CREATE INDEX ON :Geotagged(username)
If it's unique and you want the database to enforce that:
CREATE CONSTRAINT ON (g:Geotagged) ASSERT g.username IS UNIQUE
Upvotes: 2