David M.
David M.

Reputation: 4588

Connected components in a graph with 100 million nodes

I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don't have. An alternative would be to use a graph database which has a connected components function but I haven't found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I'm not 100% sure. Ideally I'm looking for a solution in Python. Many thanks.

Upvotes: 4

Views: 7328

Answers (2)

george
george

Reputation: 1829

https://graph-tool.skewed.de/performance

this tool as you can see from performance is very fast. It's written in C++ but the interface is in Python.

If this tool isn't good enough for you. (Which I think it will) then you can try Apache Giraph (http://giraph.apache.org/).

Upvotes: 3

Fred Foo
Fred Foo

Reputation: 363597

SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.

Building a sparse adjacency matrix from a sequence of (i, j) pairs adj_list where i and j are (zero-based) indices of nodes can be done with

i_indices, j_indices = zip(*adj_list)
adj_matrix = scipy.sparse.coo_matrix((np.ones(number_of_nodes),
                                     (i_indices, j_indices)))

You'll have to do some extra work for the undirected case.

This approach should be efficient if your graph is sparse enough.

Upvotes: 7

Related Questions