Reputation: 4588
I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don't have. An alternative would be to use a graph database which has a connected components function but I haven't found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I'm not 100% sure. Ideally I'm looking for a solution in Python. Many thanks.
Upvotes: 4
Views: 7328
Reputation: 1829
https://graph-tool.skewed.de/performance
this tool as you can see from performance is very fast. It's written in C++ but the interface is in Python.
If this tool isn't good enough for you. (Which I think it will) then you can try Apache Giraph (http://giraph.apache.org/).
Upvotes: 3
Reputation: 363597
SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.
Building a sparse adjacency matrix from a sequence of (i, j)
pairs adj_list
where i
and j
are (zero-based) indices of nodes can be done with
i_indices, j_indices = zip(*adj_list)
adj_matrix = scipy.sparse.coo_matrix((np.ones(number_of_nodes),
(i_indices, j_indices)))
You'll have to do some extra work for the undirected case.
This approach should be efficient if your graph is sparse enough.
Upvotes: 7