Connected components in a graph with 100 million nodes

Question

I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don't have. An alternative would be to use a graph database which has a connected components function but I haven't found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I'm not 100% sure. Ideally I'm looking for a solution in Python. Many thanks.

Fred Foo · Accepted Answer

SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.

Building a sparse adjacency matrix from a sequence of (i, j) pairs adj_list where i and j are (zero-based) indices of nodes can be done with

i_indices, j_indices = zip(*adj_list)
adj_matrix = scipy.sparse.coo_matrix((np.ones(number_of_nodes),
                                     (i_indices, j_indices)))

You'll have to do some extra work for the undirected case.

This approach should be efficient if your graph is sparse enough.

Connected components in a graph with 100 million nodes

Answers (2)

Related Questions