Joe
Joe

Reputation: 909

Find all cycles with at least 3 nodes in a directed graph using dictionary data structure

Directed Graph

The above graph was drawn using LaTeX: https://www.overleaf.com/read/rxhpghzbkhby

The above graph is represented as a dictionary in Python.

graph = {
    'A' : ['B','D', 'C'],
    'B' : ['C'],
    'C' : [],
    'D' : ['E'],
    'E' : ['G'],
    'F' : ['A', 'I'],
    'G' : ['A', 'K'],
    'H' : ['F', 'G'],
    'I' : ['H'],
    'J' : ['A'],
    'K' : []
}

I have a large graph of about 3,378,546 nodes.

Given the above-directed graph, I am trying to find circles with at least 3 and less than 5 different nodes, and output the first 3 circles.

I spent 1 day and a half on this problem. I looked in Stackoverflow and even tried to follow this Detect Cycle in a Directed Graph tutorial but couldn't come up with a solution.

In this example, the output is a tab-delimited text file where each line has a cycle.

0 A, D, E, G
1 F, I, H

0 and 1 are indexes. Also, there is no order in the alphabet of the graph nodes.

I tried this form How to implement depth-first search in Python tutorial:

visited = set()

def dfs(visited, graph, node):
    if node not in visited:
        print (node)
        visited.add(node)
        for neighbour in graph[node]:
            dfs(visited, graph, neighbour)

dfs(visited, graph, 'A')

But this doesn't help. I also tried this Post

Upvotes: 1

Views: 1321

Answers (2)

Joe
Joe

Reputation: 909

NOTE: This solution is an extended solution to the describe one. I extended to the original graph with ~3million nodes and I look for all cycles that are at least 3 nodes and less than 40 nodes and store the first 3 cycles into a file.


I came up with the following solution.

# implementation of Johnson's cycle finding algorithm
# Original paper: Donald B Johnson. "Finding all the elementary circuits of a directed graph." SIAM Journal on Computing. 1975.

from collections import defaultdict

import networkx as nx
from networkx.utils import not_implemented_for, pairwise

@not_implemented_for("undirected")
def findCycles(G):
    """Find simple cycles of a directed graph.
    A `simple cycle` is a closed path where no node appears twice.
    Two elementary circuits are distinct if they are not cyclic permutations of each other.
    This is iterator/generator version of Johnson's algorithm [1]_.
    There may be better algorithms for some cases [2]_ [3]_.
    Parameters
    ----------
    G : NetworkX DiGraph
       A directed graph
    Returns
    -------
    cycle_generator: generator
       A generator that produces elementary cycles of the graph.
       Each cycle is represented by a list of nodes along the cycle.
    Examples
    --------
    >>> graph = {'A' : ['B','D', 'C'],
                 'B' : ['C'],
                 'C' : [],
                 'D' : ['E'],
                 'E' : ['G'],
                 'F' : ['A', 'I'],
                 'G' : ['A', 'K'],
                 'H' : ['F', 'G'],
                 'I' : ['H'],
                 'J' : ['A'],
                 'K' : []
                }
    >>> G = nx.DiGraph()
    >>> G.add_nodes_from(graph.keys())
    >>> for keys, values in graph.items():
            G.add_edges_from(([(keys, node) for node in values]))
    >>> list(nx.findCycles(G))
    [['F', 'I', 'H'], ['G', 'A', 'D', 'E']]
    Notes
    -----
    The implementation follows pp. 79-80 in [1]_.
    The time complexity is $O((n+e)(c+1))$ for $n$ nodes, $e$ edges and $c$
    elementary circuits.
    References
    ----------
    .. [1] Finding all the elementary circuits of a directed graph.
       D. B. Johnson, SIAM Journal on Computing 4, no. 1, 77-84, 1975.
       https://doi.org/10.1137/0204007
    .. [2] Enumerating the cycles of a digraph: a new preprocessing strategy.
       G. Loizou and P. Thanish, Information Sciences, v. 27, 163-182, 1982.
    .. [3] A search strategy for the elementary cycles of a directed graph.
       J.L. Szwarcfiter and P.E. Lauer, BIT NUMERICAL MATHEMATICS,
       v. 16, no. 2, 192-204, 1976.
    --------
    """

    def _unblock(thisnode, blocked, B):
        stack = {thisnode}
        while stack:
            node = stack.pop()
            if node in blocked:
                blocked.remove(node)
                stack.update(B[node])
                B[node].clear()

    # Johnson's algorithm requires some ordering of the nodes.
    # We assign the arbitrary ordering given by the strongly connected comps
    # There is no need to track the ordering as each node removed as processed.
    # Also we save the actual graph so we can mutate it. We only take the
    # edges because we do not want to copy edge and node attributes here.
    subG = type(G)(G.edges())
    sccs = [scc for scc in nx.strongly_connected_components(subG) if len(scc) in list(range(3, 41))]

    # Johnson's algorithm exclude self cycle edges like (v, v)
    # To be backward compatible, we record those cycles in advance
    # and then remove from subG
    for v in subG:
        if subG.has_edge(v, v):
            yield [v]
            subG.remove_edge(v, v)

    while sccs:
        scc = sccs.pop()
        sccG = subG.subgraph(scc)
        # order of scc determines ordering of nodes
        startnode = scc.pop()
        # Processing node runs "circuit" routine from recursive version
        path = [startnode]
        blocked = set()  # vertex: blocked from search?
        closed = set()  # nodes involved in a cycle
        blocked.add(startnode)
        B = defaultdict(set)  # graph portions that yield no elementary circuit
        stack = [(startnode, list(sccG[startnode]))]  # sccG gives comp nbrs
        while stack:
            thisnode, nbrs = stack[-1]
            if nbrs:
                nextnode = nbrs.pop()
                if nextnode == startnode:
                    yield path[:]
                    closed.update(path)
                #                        print "Found a cycle", path, closed
                elif nextnode not in blocked:
                    path.append(nextnode)
                    stack.append((nextnode, list(sccG[nextnode])))
                    closed.discard(nextnode)
                    blocked.add(nextnode)
                    continue
            # done with nextnode... look for more neighbors
            if not nbrs:  # no more nbrs
                if thisnode in closed:
                    _unblock(thisnode, blocked, B)
                else:
                    for nbr in sccG[thisnode]:
                        if thisnode not in B[nbr]:
                            B[nbr].add(thisnode)
                stack.pop()
                path.pop()
        # done processing this node
        H = subG.subgraph(scc)  # make smaller to avoid work in SCC routine
        sccs.extend(scc for scc in nx.strongly_connected_components(H) if len(scc) in list(range(3, 41)))
import sys, csv, json

def findAllCycles(jsonInputFile, textOutFile):
    """Find simple cycles of a directed graph (jsonInputFile).
    Parameters:
    ----------
        jsonInputFile: a json file that has all concepts
        textOutFile: give a desired name of output file
    Returns:
    ----------
        a .text file (named: {textOutFile}.txt) has the first 3 cycles found in jsonInputFile
        Each cycle is represented by a list of nodes along the cycle
    """

    with open(jsonInputFile) as infile:
        graph = json.load(infile)
    # Convert the json file to a NetworkX directed graph
    G = nx.DiGraph()
    G.add_nodes_from(graph.keys())
    for keys, values in graph.items():
        G.add_edges_from(([(keys, node) for node in values]))
    # Search for all simple cycles existed in the graph
    _cycles = list(findCycles(G))
    # Start with an empty list and populate it by looping over all cycles
    # in _cycles that have at least 3 and less than 40 different concepts (nodes)
    cycles = []
    for cycle in _cycles:
        if len(cycle) in list(range(3, 41)):
            cycles.append(cycle)
    # Store the cycels under constraint in {textOutFile}.txt
    with open(textOutFile, 'w') as outfile:
        for cycle in cycles[:3]:
            outfile.write(','.join(n for n in cycle)+'\n')
        outfile.close()
    # When process finishes, print Done!!
    return 'Done!!'


infile = sys.argv[1]
outfile = sys.argv[2]
first_cycles = findAllCycles(infile, outfile)

To run this program, you simply use a command line as follows:

>> python3 {program file name}.py graph.json {desired output file name}[.txt][.csv]

let, for example {desired output file name}}.[txt][.csv], be first_3_cycles_found.txt

In my case, the graph has 3,378,546 nodes which took me ~40min to find all cycles using the above code. Thus, the output file will be:

enter image description here

Please contribute to this if you see it needs any improvement or something else to be added.

Upvotes: 1

m.pons
m.pons

Reputation: 140

Here is a commented code that would print the array containing the cycles found. Not much more would be necessary I believe to adjust the return value to the desired format (CSV in your case I think).

It could be that with 3M nodes, this turns out to be slow. I would then suggest going the dynamic programming way and caching/memoize the results of some recursions in order not to repeat them.

I hope this solves your problem or at least helps.

def cycles_rec(root, current_node, graph, depth, visited, min_depth, max_depth):
    depth += 1

    # First part our stop conditions
    if current_node in visited or current_node not in graph.keys():
        return ''

    if depth >= max_depth:
        return ''

    visited.append(current_node)

    if root in graph[current_node] and depth >= min_depth:
        return current_node

    # The recursive part
    # for each connection we try to find recursively one that would cycle back to our root
    for connections in graph[current_node]:
        for connection in connections:
            result = cycles_rec(root, connection, graph, depth, visited, min_depth, max_depth)
            # If a match was found, it would "bubble up" here, we can return it along with the
            # current connection that "found it"
            if result != '':
                return current_node + ' ' + result

    # If we are here we found no cycle        
    return ''

def cycles(graph, min_depth = 3, max_depth = 5):
    cycles = {}
    for node, connections in graph.items():
        for connection in connections:
            visited = []
            # Let the recursion begin here
            result = cycles_rec(node, connection, graph, 1, visited, min_depth, max_depth)
            if result == '':
                continue 
            # Here we found a cycle.
            # Fingerprint is only necessary in order to not repeat the cycles found in the results
            # It could be ignored if repeating them is not important
            # It's based on the fact that nodes are all represented as letters here
            # It could be it's own function returning a hash for example if nodes have a more
            # complex representation
            fingerprint = ''.join(sorted(list(node + ' ' + result)))
            if fingerprint not in cycles.keys():
                cycles[fingerprint] = node + ' ' + result

    return list(cycles.values())

So, assuming the graph variable you declared in your example:

print(cycles(graph, 3, 5))

Would print out

['A D E G', 'F I H']

Upvotes: 1

Related Questions