David
David

Reputation: 247

How to improve performance of two chained aggregations in neo4j query

Assuming I have a neo4j graph of the nature:

enter image description here

create (SK:Author {name:'Stephen King'}), (JK:Author {name:'J.K. Rowling'}), (DS:Author {name:'Dr. Seuss'}), (TS:Book {name:'The Stand'}), (HP:Book {name:'Harry Potter'}), (CH:Book {name:'Cat in the Hat'}), (SHINING:Book {name:'The Shining'}), (PAF:Genre {name:'Post-Apocalyptic fiction'}), (F:Genre {name:'Fantasy'}), (C:Genre {name:'Childrens'}), (HORROR:Genre {name:'Horror'}), (SK)<-[:WRITTEN_BY]-(TS)-[:CATEGORIZED_AS]->(PAF), (JK)<-[:WRITTEN_BY]-(HP)-[:CATEGORIZED_AS]->(F), (DS)<-[:WRITTEN_BY]-(CH)-[:CATEGORIZED_AS]->(C), (SK)<-[:WRITTEN_BY]-(SHINING)-[:CATEGORIZED_AS]->(HORROR)

Neo4j console link: http://console.neo4j.org/r/2d69kq

I have about 53,000 author nodes, 6 million book nodes, and 9,000 genre nodes.

For a query like:

match (b:Book)-[:WRITTEN_BY]->(a:Author)
where a.name in ['Stephen King', 'J.K. Rowling']
with a, collect(b) as bs
unwind bs as book
match (g:Genre)<-[r:CATEGORIZED_AS]-(book)
where id(g) in [13, 14, 15, 16]
with a, count(distinct book) as book_count_author, collect(book) as bs
unwind bs as book
match (g:Genre)<-[r:CATEGORIZED_AS]-(book)
where id(g) in [13, 14, 15, 16]
return a.name, g.name, count(distinct book) as book_count_genre, book_count_author

It takes roughly 12 seconds to complete. I've tried to rewrite the query a few different ways and use index hinting, but can't figure out any ways to make it any faster. Any thoughts? Obviously this example is simplified, but I do have indexes on the appropriate properties.

Here's a sample result from the chained aggregations: enter image description here

I need two aggregations. The first one is count of books by authors, constrained by the specified genres in the 2nd match. The second count is of books by each author in each genre, again constrained by the same genres.

Upvotes: 0

Views: 70

Answers (2)

Michael Hunger
Michael Hunger

Reputation: 41706

What @InverseFalcon said.

You can optimize it a bit further:

MATCH (g:Genre)<-[r:CATEGORIZED_AS]-(book:Book)-[:WRITTEN_BY]->(a:Author) 
WHERE a.name in ['Stephen King', 'J.K. Rowling'] 
  AND g.name IN ['Post-Apocalyptic fiction','Childrens','Horror','Fantasy'] 
WITH a, g, count(distinct book) as  book_count_author_genre 
RETURN a.name, collect({ genre: g.name, count: book_count_author_genre}), 
       sum(book_count_author_genre) as book_count_author

You might need to use index-hints for authors and genres

If you use id's for lookup it will be faster.

Upvotes: 0

InverseFalcon
InverseFalcon

Reputation: 30407

Assuming that you have an index (or existence constraint) on :Author.name, you can probably use this:

match (book:Book)-[:WRITTEN_BY]->(a:Author)
where a.name in ['Stephen King', 'J.K. Rowling']
with a, book
match (g:Genre)<-[:CATEGORIZED_AS]-(book)
return a, g, size((g)<-[:CATEGORIZED_AS]-()) as book_count_genre, count(book) as book_count_author

Collecting and unwinding can be expensive, try to avoid it where you can. Note that you can find the size of a pattern to get a count of the pattern that would otherwise not be retrievable by a count on a node given your columns.

EDIT

Modified query for clarified requirements

book_count_genre - the number of books with that genre by any author

book_count_author - the number of books with that genre written by the given author

Upvotes: 0

Related Questions