Reputation: 407

Neo4j crashes on 4th degree Cypher query

neo4j noob here, on Neo4j 2.0.0 Community

I've got a graph database of 24,000 movies and 2700 users, and somewhere around 60,000 LIKE relationships between a user and a movie.

Let's say that I've got a specific movie (movie1) in mind.

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)
RETURN usersLikingMovie1;

I can quickly and easily find the users who liked the movie with the above query. I can follow this path further to get the users who liked the same movies that as the people who liked movie1. I call these generation 2 users

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
RETURN usersGen2;

This query takes about 3 seconds and returns 1896 users.
Now I take this query one step further to get the movies liked by the users above (generation 2 movies)

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;

This query causes neo4j to spin for several minutes at 100% cpu utilization and using 4GB of RAM. Then it sends back an exception "OutOfMemoryError: GC overhead limit exceeded".

I was hoping someone could help me out and explain to me the issue. Is Neo4j not meant to handle a query of this depth in a performant manner? Is there something wrong with my Cypher query?

Thanks for taking the time to read.

Upvotes: 0

Answers (3)

Huston

Reputation: 83

Cypher is a pattern matching language, and it is important to remember that the MATCH clause will always find a pattern everywhere it exists in the Graph.

The problem with the MATCH clause you are using is that sometimes Cypher will find different patterns where 'usersGen2' is the same as 'usersLikingMovie1' and where 'movie1' is the same as 'movieGen1' across different patterns. So, in essence, Cypher finds the pattern every single time it exists in the Graph, is holding it in memory for the duration of the query, and then returning all the moviesGen2 nodes, which could actually be the same node n number of times.

MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)

If you explicitly tell Cypher that the movies and users should be different for each match pattern it should solve the issue. Try this? Additionally, The DISTINCT parameter will make sure you only grab each 'moviesGen2' node once.

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
WHERE movie1 <> moviesGen2 AND usersLikingMovie1 <> usersGen2
RETURN DISTINCT moviesGen2;

Additionally, in 2.0, the start clause is not required. So you can actually leave out the START clause all together (However - only if you are NOT using a legacy index and use labels)...

Hope this works... Please correct my answer if there are syntax errors...

Upvotes: 1

Michael Hunger

Reputation: 41676

To minimize the explosion that @cod3monk3y talks about, I'd limit the number of intermediate results.

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;

or even like this

START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04") 
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)
WITH distinct usersGen2
MATCH (usersGen2)-[:LIKES]->(moviesGen2)
RETURN distinct moviesGen2;

if you want to, you can use "profile start ..." in the neo4j shell to see how many hits / db-rows you create in between, starting with your query and then these two.

Upvotes: 1

cod3monk3y

Reputation: 9843

That's a pretty intense query, and the deeper you go the closer you're probably getting to a set of all users that ever rated any movie, since you're essentially just expanding out through the graph in tree form starting with your given movie. @Huston's WHERE and DISTINCT clauses will help to prune branches you've already seen, but you're still just expanding out through the tree.

The branching factor of your tree can be estimated with two values:

u, the average number of users that liked a movie (incoming to :Movie)
m, the average number of movies that each user liked (outgoing from :User)

For an estimate, your first step will return m users. On the next step, for each user you get all the movies each of them liked followed by all the users that liked all of those movies:

gen(1) => u
gen(2) => u * (m * u)

For each generation you'll tack on another m*u, so your third generation is:

gen(3) => u * (m * u) * (m * u)

Or more generically:

gen(n) => u^n * m^(n-1)

You could estimate your branching factors by computing the average of your likes/users and likes/movie, but that's probably very inaccurate since it gives you 22.2 likes/user and 2.5 likes/movie. Those numbers aren't reasonable for any movie that's worthy of rating. A better approach would be to take the median number of ratings or look at a histogram and use the peaks as your branching factors.

To put this in perspective, the average Netflix user rated 200 movies. The Netflix Prize training set had 17,770 movies, 480,189 users, and 100,480,507 ratings. That's 209 ratings/user and 5654 ratings/movie.

To keep things simple (and assuming your data set is much smaller), let's use:

m = 20 movie ratings/user 
u = 100 users have rated/movie

Your query in gen-3 (without distincts) will return:

gen(3) = 100^3 * 20^2
       = 400,000,000

400 million nodes (users)

Since you only have 2700 users, I think it's safe to say your query probably returns every user in your data set (rather, 148 thousand-ish copies of each user).

Your movie nodes in ASCII -- (n:Movie {movieid:"88cacfca-3def-4b2c-acb2-8e7f4f28be04"}) are 58 bytes minimum. If your users are about the same, let's say each node is 60 bytes, your storage requirement for this resultant set is:

400,000,000 nodes * 60 bytes
= 24,000,000,000 bytes
= 23,437,500 kb
= 22,888 Mb
= 22.35 Gb

So by my conservative estimates, your query requires 22 Gigabytes of storage. This seems quite reasonable that Neo4j would run out of memory.

My guess is that you're trying to find similarities in the patterns of users, but the query you're using is returning all the users in your dataset duplicated a bunch of times. Maybe you want to be asking questions of your data more like:

what users rate movies most like me?
what users rated most of the same movies as I rated
what movies have users that have rated similar movies to me watched that I haven't watched yet?

Cheers, cm

Upvotes: 1

Neo4j crashes on 4th degree Cypher query

Answers (3)

Related Questions