Jonathan
Jonathan

Reputation: 1936

Python: Finding connected components in a graph presented as edge lists

I have an edge list in the form

start | end
 a       c
 b       d
 e       b

I have tens of millions of edges (approx 30 million) and I'm not able to read in the entire graph into memory - at least not using a library like networkx which is a bit memory intensive.

My goal is to find all connected components in the graph represented by that list that contain less than x number of nodes.

For example, I want to get all connected components with less than x=30 nodes. But I don't want to do this by building the entire graph and then doing a search for connected components (e.g. calling this networkx command: nx.connected_component_subgraphs(nxg)).

Is there a way I can search for connected components just using the edge list file and without having to build the entire graph?

Additional info: The node names are strings of length 10-20 asci values.

Upvotes: 3

Views: 1774

Answers (1)

trincot
trincot

Reputation: 349956

You would first need to reduce the footprint of your data by assigning shorter identifiers for your nodes. You could write a mapping between those shorter identifiers and original names to another file, so you can translate any solution to those names after running an algorithm.

Assuming you have short enough identifiers for your nodes, you could load everything in memory, in a dictionary keyed by node identifiers.

Then use the Union-Find structure & algorithm to identify the connected components.

Finally filter those by the maximum size they are allowed to have.

There are some libraries out there which provide Union-Find implementations, which could provide better performance. Here is a simple implementation of Union-Find:

class Node:
    def __init__(self, key):
        self.key = key
        self.parent = self
        self.size = 1

class UnionFind(dict):
    def find(self, key):
        node = self.get(key, None)
        if node is None:
            node = self[key] = Node(key)
        else:
            while node.parent != node: 
                # walk up & perform path compression
                node.parent, node = node.parent.parent, node.parent
        return node

    def union(self, key_a, key_b):
        node_a = self.find(key_a)
        node_b = self.find(key_b)
        if node_a != node_b:  # disjoint? -> join!
            if node_a.size < node_b.size:
                node_a.parent = node_b
                node_b.size += node_a.size
            else:
                node_b.parent = node_a
                node_a.size += node_b.size

Then, the following function would load that structure from an iterator, and return the components whose sizes meet the requirement:

from collections import defaultdict

def find_components(line_iterator, max_size):
    forest = UnionFind()

    for line in line_iterator:
        forest.union(*line.split())

    result = defaultdict(list)
    for key in forest.keys():
        root = forest.find(key)
        if root.size <= max_size:
            result[root.key].append(key)

    return list(result.values())

Here is a demo for the following graph:

enter image description here

data = """x d
c j
i e
f x
n z
a u
g r
w x
p l
u o
m g
k s
t q
y l
h m
n b
k v
e u
i o
r m
n c
x q
f q
j l
s v"""

results = find_components(data.splitlines(), 5)
print(results)

The output for this demo is:

[['i', 'e', 'a', 'u', 'o'], ['g', 'r', 'm', 'h'], ['k', 's', 'v']]

Upvotes: 4

Related Questions