GMc
GMc

Reputation: 1774

How do I calculate totals for the root nodes in a tree in neo4j?

I am learning cypher and was presented with a problem that I have actually solved, but wondering if there is a better way of writing the cypher query.

I have a hierarchy (tree) of arbitrary depth consisting of companies along with their subsidiaries and the subsidiaries of the subsidiaries and so on.

Each company / subsidiary is a node and an attribute on each node is the revenue earned by that particular company / subsidiary.

I want to calculate the total revenue for just the root nodes. That is I need to calculate the total revenue for the top level company being the sum of its own revenue plus the revenue of all the subsidiaries underneath it.

The query I have come up with calculates all of the sub totals for each mini-tree (a parent and its direct subsidiaries). The query appears to start at the bottom of the trees and works its way up.

The output of the first part of the query is a list of all of the nodes (except the leaves) with the total of all of the nodes under it.

Next I calculate all of the root nodes and "join" this list of root nodes to the previous result.

This returns the answer that I need. However, it seems quite convoluted - hence my question, is there a way to do this more elegantly - perhaps with just a single match clause?

Following is some sample data and the query I have written so far.

create (a:Company {revenue: 10, cid: "a"})
create (b:Company {revenue: 10, cid: "b"})
create (c:Company {revenue: 20, cid: "c"})
create (d:Company {revenue: 15, cid: "d"})
create (e:Company {revenue: 20, cid: "e"})
create (f:Company {revenue: 25, cid: "f"})
create (g:Company {revenue: 30, cid: "g"})
create (h:Company {revenue: 10, cid: "h"})
create (i:Company {revenue: 20, cid: "i"})
create (j:Company {revenue: 20, cid: "j"})
create (k:Company {revenue: 40, cid: "k"})
create (l:Company {revenue: 10, cid: "l"})
create (m:Company {revenue:  5, cid: "m"})

create (b)-[:REPORTS_TO]->(a)
create (c)-[:REPORTS_TO]->(a)
create (d)-[:REPORTS_TO]->(b)
create (e)-[:REPORTS_TO]->(c)
create (f)-[:REPORTS_TO]->(c)

create (h)-[:REPORTS_TO]->(g)
create (i)-[:REPORTS_TO]->(g)
create (j)-[:REPORTS_TO]->(h)
create (k)-[:REPORTS_TO]->(h)
create (l)-[:REPORTS_TO]->(i)
create (m)-[:REPORTS_TO]->(i)
;

Here is the query that I have created:

// First Calculate total revenue for each company in the tree with subsidiaries.
// This will include top level and intermediate level companies.
match (c: Company)<-[:REPORTS_TO*]-(s:Company)
  with c.cid as r_cid, sum (s.revenue) + c.revenue as tot_revenue

// Next, Determine the root nodes
// "join" the list of root nodes to the totals for each company.
// The result is the root node companies with their total revenues.
  match (c)
  where not ()<-[:REPORTS_TO]-(c) AND
      c.cid = r_cid
      // Return the root company id and the revenue for it.
  return c.cid, tot_revenue;

The above returns the result I am expecting which is:

+---------------------+
| c.cid | tot_revenue |
+---------------------+
| "g"   | 135         |
| "a"   | 100         |
+---------------------+

Again, this question is about whether or not there is a better way to write the cypher query than the solution I have come up with?

Upvotes: 1

Views: 129

Answers (1)

Rajendra Kadam
Rajendra Kadam

Reputation: 4052

Yes, There are some ways to make your Cypher query better.

Few things you are doing in your query that are not required or can be improved:

  1. Scan all the nodes second time and then filter in WHERE by matching cid of the current node with these nodes to get the node which you already have.

  2. Calculating total revenue for all the companies. You can avoid total revenue calculations for subsidiaries, as you are not using it anywhere.

To make queries run efficiently you need to minimize the total database calls(AKA db hits). You can check db hits for your by profiling the query. This will show you a query plan and which operators are doing most of the work. You need to run the query by adding PROFILE in the beginning.

I did profiling for your query. Total db hits for your query were 311.

Let's make changes to your query step by step:

Removing unnecessary comparisons: Total db hits reduced to 131

PROFILE 
MATCH (c:Company)<-[:REPORTS_TO*]-(s:Company)
WITH c, sum(s.revenue) + c.revenue AS tot_revenue
MATCH (c)
WHERE  NOT ()<-[:REPORTS_TO]-(c)
RETURN c.cid, tot_revenue;

Avoid calculating total revenue for subsidiaries by filtering root companies prior to calculation. Total db hits reduced to 108

PROFILE 
MATCH (c:Company)<-[:REPORTS_TO*]-(s:Company)
WHERE  NOT ()<-[:REPORTS_TO]-(c)
WITH c.cid AS r_cid, sum(s.revenue) + c.revenue AS tot_revenue
RETURN r_cid, tot_revenue;

Separating alias and addition on company revenue from aggregation. Total db hits reduced to 90

PROFILE 
MATCH (c:Company)<-[:REPORTS_TO*]-(s:Company)
WHERE  NOT ()<-[:REPORTS_TO]-(c)
WITH c, sum(s.revenue) AS sub_tot_revenue
RETURN c.cid AS cid, sub_tot_revenue + c.revenue AS tot_revenue;

These are some ways to improve your solution. You can read more about query tuning in Neo4j documentation.

Upvotes: 1

Related Questions