user21143831
user21143831

Reputation: 35

Get ancestors of all nodes in a graph

I have a Directed Acyclic graph with 2 million nodes and a value attached to each node. For each node I want to take average of all ancestors. The maximum depth would be 5.

The ancestors method in networkx package gives all ancestors for a single node. I need to find the ancestors for all nodes in my graph. I can definitely loop over all nodes and use ancestors to get them for all nodes one by one. But since my graph has many nodes, I don't want to use for loop for that. I was wondering if there is any faster way to get the ancestors for multiple nodes at a same time.

This just an example of a graph if you want to work on:

import networkx as nx
DG = nx.DiGraph()
DG.add_edges_from([(1, 2), (1, 3), (3,4), (4, 5), (4, 6), (5, 6)])

This loop would work but it is not fast and efficient for the size of my graph:

ancestors = {node: nx.ancestors(DG, node) for node in DG.nodes}

I know the graph already has the information I want and the above solution does lots of redundant computation. So, I would appreciate it if you could give a right function to do this for me. I know the issue, I don't know the solution.

Upvotes: 0

Views: 621

Answers (1)

ravenspoint
ravenspoint

Reputation: 20616

There is very little point in storing somewhere the the ancestors of every node. This information would be redundant and require almost as much memory as the graph itself, depending on how deep your graph is.

For most uses, the information is already efficiently stored in the graph.

Here is some C++ code to average the values of a nodes ancestors to a depth of 5:

https://github.com/JamesBremner/PathFinder/blob/38cf5f449aa6ef09c157cdc26fa4490273f2d902/src/graphex.cpp#L121-L173

Timing results for running this on 1000 randomly selected nodes on graphs of different sizes:

graph node count nodes run total time
4,700 1000 2.5 secs
144,650 1000 613 secs

Linear ( !? ) extrapolation to a complete run of a 2 million node network gives an estimation of total expected run time : 18 hours

Extrapolation details:

Increasing node count factor 145,000 / 5,000 = 30 Increasing time factor 600 / 2.5 = 240

Ratio : 240 / 30 = 8

Increasing node count 2,000,000 / 150,000 = 13

8 * 13 * 600 secs = 64,000secs = 18 hours

Note: all calculations very approximate and assume the number of 5 deep ancestors per node is constant with respect to total node count ( scalable graph )

Upvotes: 0

Related Questions