Reputation: 909
The above graph was drawn using LaTeX: https://www.overleaf.com/read/rxhpghzbkhby
The above graph is represented as a dictionary in Python.
graph = {
'A' : ['B','D', 'C'],
'B' : ['C'],
'C' : [],
'D' : ['E'],
'E' : ['G'],
'F' : ['A', 'I'],
'G' : ['A', 'K'],
'H' : ['F', 'G'],
'I' : ['H'],
'J' : ['A'],
'K' : []
}
I have a large graph of about 3,378,546 nodes.
Given the above-directed graph, I am trying to find circles with at least 3 and less than 5 different nodes, and output the first 3 circles.
I spent 1 day and a half on this problem. I looked in Stackoverflow and even tried to follow this Detect Cycle in a Directed Graph tutorial but couldn't come up with a solution.
In this example, the output is a tab-delimited text file where each line has a cycle.
0 A, D, E, G
1 F, I, H
0
and 1
are indexes.
Also, there is no order in the alphabet of the graph nodes.
I tried this form How to implement depth-first search in Python tutorial:
visited = set()
def dfs(visited, graph, node):
if node not in visited:
print (node)
visited.add(node)
for neighbour in graph[node]:
dfs(visited, graph, neighbour)
dfs(visited, graph, 'A')
But this doesn't help. I also tried this Post
Upvotes: 1
Views: 1321
Reputation: 909
NOTE: This solution is an extended solution to the describe one. I extended to the original graph with ~3million nodes and I look for all cycles that are at least 3 nodes and less than 40 nodes and store the first 3 cycles into a file.
I came up with the following solution.
# implementation of Johnson's cycle finding algorithm
# Original paper: Donald B Johnson. "Finding all the elementary circuits of a directed graph." SIAM Journal on Computing. 1975.
from collections import defaultdict
import networkx as nx
from networkx.utils import not_implemented_for, pairwise
@not_implemented_for("undirected")
def findCycles(G):
"""Find simple cycles of a directed graph.
A `simple cycle` is a closed path where no node appears twice.
Two elementary circuits are distinct if they are not cyclic permutations of each other.
This is iterator/generator version of Johnson's algorithm [1]_.
There may be better algorithms for some cases [2]_ [3]_.
Parameters
----------
G : NetworkX DiGraph
A directed graph
Returns
-------
cycle_generator: generator
A generator that produces elementary cycles of the graph.
Each cycle is represented by a list of nodes along the cycle.
Examples
--------
>>> graph = {'A' : ['B','D', 'C'],
'B' : ['C'],
'C' : [],
'D' : ['E'],
'E' : ['G'],
'F' : ['A', 'I'],
'G' : ['A', 'K'],
'H' : ['F', 'G'],
'I' : ['H'],
'J' : ['A'],
'K' : []
}
>>> G = nx.DiGraph()
>>> G.add_nodes_from(graph.keys())
>>> for keys, values in graph.items():
G.add_edges_from(([(keys, node) for node in values]))
>>> list(nx.findCycles(G))
[['F', 'I', 'H'], ['G', 'A', 'D', 'E']]
Notes
-----
The implementation follows pp. 79-80 in [1]_.
The time complexity is $O((n+e)(c+1))$ for $n$ nodes, $e$ edges and $c$
elementary circuits.
References
----------
.. [1] Finding all the elementary circuits of a directed graph.
D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
https://doi.org/10.1137/0204007
.. [2] Enumerating the cycles of a digraph: a new preprocessing strategy.
G. Loizou and P. Thanish, Information Sciences, v. 27, 163-182, 1982.
.. [3] A search strategy for the elementary cycles of a directed graph.
J.L. Szwarcfiter and P.E. Lauer, BIT NUMERICAL MATHEMATICS,
v. 16, no. 2, 192-204, 1976.
--------
"""
def _unblock(thisnode, blocked, B):
stack = {thisnode}
while stack:
node = stack.pop()
if node in blocked:
blocked.remove(node)
stack.update(B[node])
B[node].clear()
# Johnson's algorithm requires some ordering of the nodes.
# We assign the arbitrary ordering given by the strongly connected comps
# There is no need to track the ordering as each node removed as processed.
# Also we save the actual graph so we can mutate it. We only take the
# edges because we do not want to copy edge and node attributes here.
subG = type(G)(G.edges())
sccs = [scc for scc in nx.strongly_connected_components(subG) if len(scc) in list(range(3, 41))]
# Johnson's algorithm exclude self cycle edges like (v, v)
# To be backward compatible, we record those cycles in advance
# and then remove from subG
for v in subG:
if subG.has_edge(v, v):
yield [v]
subG.remove_edge(v, v)
while sccs:
scc = sccs.pop()
sccG = subG.subgraph(scc)
# order of scc determines ordering of nodes
startnode = scc.pop()
# Processing node runs "circuit" routine from recursive version
path = [startnode]
blocked = set() # vertex: blocked from search?
closed = set() # nodes involved in a cycle
blocked.add(startnode)
B = defaultdict(set) # graph portions that yield no elementary circuit
stack = [(startnode, list(sccG[startnode]))] # sccG gives comp nbrs
while stack:
thisnode, nbrs = stack[-1]
if nbrs:
nextnode = nbrs.pop()
if nextnode == startnode:
yield path[:]
closed.update(path)
# print "Found a cycle", path, closed
elif nextnode not in blocked:
path.append(nextnode)
stack.append((nextnode, list(sccG[nextnode])))
closed.discard(nextnode)
blocked.add(nextnode)
continue
# done with nextnode... look for more neighbors
if not nbrs: # no more nbrs
if thisnode in closed:
_unblock(thisnode, blocked, B)
else:
for nbr in sccG[thisnode]:
if thisnode not in B[nbr]:
B[nbr].add(thisnode)
stack.pop()
path.pop()
# done processing this node
H = subG.subgraph(scc) # make smaller to avoid work in SCC routine
sccs.extend(scc for scc in nx.strongly_connected_components(H) if len(scc) in list(range(3, 41)))
import sys, csv, json
def findAllCycles(jsonInputFile, textOutFile):
"""Find simple cycles of a directed graph (jsonInputFile).
Parameters:
----------
jsonInputFile: a json file that has all concepts
textOutFile: give a desired name of output file
Returns:
----------
a .text file (named: {textOutFile}.txt) has the first 3 cycles found in jsonInputFile
Each cycle is represented by a list of nodes along the cycle
"""
with open(jsonInputFile) as infile:
graph = json.load(infile)
# Convert the json file to a NetworkX directed graph
G = nx.DiGraph()
G.add_nodes_from(graph.keys())
for keys, values in graph.items():
G.add_edges_from(([(keys, node) for node in values]))
# Search for all simple cycles existed in the graph
_cycles = list(findCycles(G))
# Start with an empty list and populate it by looping over all cycles
# in _cycles that have at least 3 and less than 40 different concepts (nodes)
cycles = []
for cycle in _cycles:
if len(cycle) in list(range(3, 41)):
cycles.append(cycle)
# Store the cycels under constraint in {textOutFile}.txt
with open(textOutFile, 'w') as outfile:
for cycle in cycles[:3]:
outfile.write(','.join(n for n in cycle)+'\n')
outfile.close()
# When process finishes, print Done!!
return 'Done!!'
infile = sys.argv[1]
outfile = sys.argv[2]
first_cycles = findAllCycles(infile, outfile)
To run this program, you simply use a command line as follows:
>> python3 {program file name}.py graph.json {desired output file name}[.txt][.csv]
let, for example {desired output file name}}.[txt][.csv], be first_3_cycles_found.txt
In my case, the graph has 3,378,546 nodes which took me ~40min to find all cycles using the above code. Thus, the output file will be:
Please contribute to this if you see it needs any improvement or something else to be added.
Upvotes: 1
Reputation: 140
Here is a commented code that would print the array containing the cycles found. Not much more would be necessary I believe to adjust the return value to the desired format (CSV in your case I think).
It could be that with 3M nodes, this turns out to be slow. I would then suggest going the dynamic programming way and caching/memoize the results of some recursions in order not to repeat them.
I hope this solves your problem or at least helps.
def cycles_rec(root, current_node, graph, depth, visited, min_depth, max_depth):
depth += 1
# First part our stop conditions
if current_node in visited or current_node not in graph.keys():
return ''
if depth >= max_depth:
return ''
visited.append(current_node)
if root in graph[current_node] and depth >= min_depth:
return current_node
# The recursive part
# for each connection we try to find recursively one that would cycle back to our root
for connections in graph[current_node]:
for connection in connections:
result = cycles_rec(root, connection, graph, depth, visited, min_depth, max_depth)
# If a match was found, it would "bubble up" here, we can return it along with the
# current connection that "found it"
if result != '':
return current_node + ' ' + result
# If we are here we found no cycle
return ''
def cycles(graph, min_depth = 3, max_depth = 5):
cycles = {}
for node, connections in graph.items():
for connection in connections:
visited = []
# Let the recursion begin here
result = cycles_rec(node, connection, graph, 1, visited, min_depth, max_depth)
if result == '':
continue
# Here we found a cycle.
# Fingerprint is only necessary in order to not repeat the cycles found in the results
# It could be ignored if repeating them is not important
# It's based on the fact that nodes are all represented as letters here
# It could be it's own function returning a hash for example if nodes have a more
# complex representation
fingerprint = ''.join(sorted(list(node + ' ' + result)))
if fingerprint not in cycles.keys():
cycles[fingerprint] = node + ' ' + result
return list(cycles.values())
So, assuming the graph variable you declared in your example:
print(cycles(graph, 3, 5))
Would print out
['A D E G', 'F I H']
Upvotes: 1