Reputation: 63
I have data on movies that can be either comedies or dramas. I have actors in those movies, who can have multiple roles per movie. I want to find all distinct sets of movies and actors where:
(drama1:Movie {Genre:'Drama'})-[role1]-(actor1:Actor)-[role2]-(comedy:Movie {Genre:'Comedy'})-[role3]-(actor2:Actor)-[role4]-(drama2:Movie {Genre:'Drama'})
That is, I want to find where two (different) dramas are connected by a comedy with which both dramas share at least one actor. I'm struggling to do this efficiently and to get neo4j to give me distinct groups of drama1,drama2,actor1,actor2,comedy. My data is on the order of a few million nodes and tens of millions of relationships, so efficiency is important. A toy setup, which can be plugged into the neo4j online console is:
create (a:Movie {Genre:'Comedy'}), (b:Movie {Genre:'Comedy'}), (c:Movie {Genre:'Comedy'}), (d:Movie {Genre:'Comedy'}), (f:Movie {Genre:'Drama'}), (h:Movie {Genre:'Drama'}),(i:Actor {Name:'Sarah'}),(j:Actor {Name: 'Maria'}),(k:Actor {Name:'Mike'}),(l:Actor {Name:'Jane'}),(m:Actor {Name:'Sam'}),(q:Actor {Name:'Matt'}),(r:Actor {Name:'Tom'}), (i)-[:ActedIn]->(a), (i)-[:ActedIn]->(a) , (i)-[:ActedIn]->(a), (i)-[:ActedIn]->(a) , (i)-[:ActedIn]->(f) , (j)-[:ActedIn]->(b) , (j)-[:ActedIn]->(h) , (j)-[:ActedIn]->(h) , (q)-[:ActedIn]->(c) , (q)-[:ActedIn]->(b) , (q)-[:ActedIn]->(a) , (r)-[:ActedIn]->(f) , (r)-[:ActedIn]->(f) , (r)-[:ActedIn]->(a) , (j)-[:ActedIn]->(b) , (j)-[:ActedIn]->(c) , (k)-[:ActedIn]->(d), (l)-[:ActedIn]->(c) , (i)-[:ActedIn]->(a) , (i)-[:ActedIn]->(h) , (m)-[:ActedIn]->(h)
I've mostly tried variations of
match (drama1:Movie {Genre:'Drama'})-[role1]-(actor1:Actor)-[role2]-(comedy:Movie {Genre:'Comedy'})-[role3]-(actor2:Actor)-[role4]-(drama2:Movie {Genre:'Drama'}) return drama1,actor1,comedy,actor2,drama2
Upvotes: 0
Views: 43
Reputation: 108
You can use a multi match query like this:
MATCH p = (drama1:Movie {Genre:'Drama'})-[r1]-(a1)-[r2]-(comedy:Movie {Genre:'Comedy'})-[r3]-(a2)-[r4]-(drama2:Movie {Genre:'Drama'})
MATCH (drama1)-[]-(a:Actor)-[]-(drama2)
WHERE drama1<>drama2
RETURN nodes(p)
Query explanation:
There is no much to do to optimize the query because you are not filtering by any property, so you are going through the entire graph. You can try creating an index on 'Genre' property but it's not recomended if you only have two different values.
An other option is using SKIP/LIMIT to retrive patterns in smaller bulks:
MATCH p = (drama1:Movie {Genre:'Drama'})-[r1]-(a1)-[r2]-(comedy:Movie {Genre:'Comedy'})-[r3]-(a2)-[r4]-(drama2:Movie {Genre:'Drama'})
MATCH (drama1)-[]-(a:Actor)-[]-(drama2)
WHERE drama1<>drama2
RETURN nodes(p) SKIP 0 LIMIT 100000
This query gets the first 100k patterns and then you have to increment the skip to get the next 100k (SKIP 100000 LIMIT 100000). By this way you have to run 10 queries to get all patterns.
Upvotes: 0