Reputation: 1936
I have an edge list in the form
start | end
a c
b d
e b
I have tens of millions of edges (approx 30 million) and I'm not able to read in the entire graph into memory - at least not using a library like networkx which is a bit memory intensive.
My goal is to find all connected components in the graph represented by that list that contain less than x
number of nodes.
For example, I want to get all connected components with less than x=30
nodes. But I don't want to do this by building the entire graph and then doing a search for connected components (e.g. calling this networkx command: nx.connected_component_subgraphs(nxg)
).
Is there a way I can search for connected components just using the edge list file and without having to build the entire graph?
Additional info: The node names are strings of length 10-20 asci values.
Upvotes: 3
Views: 1774
Reputation: 349956
You would first need to reduce the footprint of your data by assigning shorter identifiers for your nodes. You could write a mapping between those shorter identifiers and original names to another file, so you can translate any solution to those names after running an algorithm.
Assuming you have short enough identifiers for your nodes, you could load everything in memory, in a dictionary keyed by node identifiers.
Then use the Union-Find structure & algorithm to identify the connected components.
Finally filter those by the maximum size they are allowed to have.
There are some libraries out there which provide Union-Find implementations, which could provide better performance. Here is a simple implementation of Union-Find:
class Node:
def __init__(self, key):
self.key = key
self.parent = self
self.size = 1
class UnionFind(dict):
def find(self, key):
node = self.get(key, None)
if node is None:
node = self[key] = Node(key)
else:
while node.parent != node:
# walk up & perform path compression
node.parent, node = node.parent.parent, node.parent
return node
def union(self, key_a, key_b):
node_a = self.find(key_a)
node_b = self.find(key_b)
if node_a != node_b: # disjoint? -> join!
if node_a.size < node_b.size:
node_a.parent = node_b
node_b.size += node_a.size
else:
node_b.parent = node_a
node_a.size += node_b.size
Then, the following function would load that structure from an iterator, and return the components whose sizes meet the requirement:
from collections import defaultdict
def find_components(line_iterator, max_size):
forest = UnionFind()
for line in line_iterator:
forest.union(*line.split())
result = defaultdict(list)
for key in forest.keys():
root = forest.find(key)
if root.size <= max_size:
result[root.key].append(key)
return list(result.values())
Here is a demo for the following graph:
data = """x d
c j
i e
f x
n z
a u
g r
w x
p l
u o
m g
k s
t q
y l
h m
n b
k v
e u
i o
r m
n c
x q
f q
j l
s v"""
results = find_components(data.splitlines(), 5)
print(results)
The output for this demo is:
[['i', 'e', 'a', 'u', 'o'], ['g', 'r', 'm', 'h'], ['k', 's', 'v']]
Upvotes: 4