Reputation: 246
I have a data set that is a csv/txt file representing a network. Each line in the file contains two node name separated by a comma. My data file contacts about 330k nodes and about 550k edges. I am trying to create just a very rudimentary graph of this (yes, I know it will be very cluttered) using the following code:
import networkx as nx
import matplotlib.pyplot as plt
import sys
import numpy as np
f = open('dataFile.txt', 'rb')
G = nx.read_edgelist(f, delimiter=',', nodetype=str)
f.close()
print(nx.number_of_nodes(G))
print(nx.number_of_edges(G))
plt.figure(1)
nx.draw(G)
plt.savefig("graph.pdf")
I am running this on an AWS EC2 m4.4xlarge instance and it is pegging at 100% of the CPUs and only 1% of the memory.
I am skeptical by that since I thought networkx was memory intensive, not a CPU hog. Right now, it is spinning on the nx.draw command. Is there any way I can monitor how far into the graph generation it is?
Upvotes: 4
Views: 5459
Reputation: 2368
Networkx' draw
will indeed take a long time. However, it is not the only layout / drawing function available through Networkx and your graph is not that big.
You could try draw_graphviz with something as simple as networkx.draw_graphviz(G, 'dot')
or networkx.draw_graphviz(G, 'neato')
(where G
is your networkx graph).
This call will use graphviz for the node layout and matplotlib for the actual drawing. Therefore, you better also make sure that the machine has graphviz installed (sudo apt-get install graphviz
, sudo pip install pygraphviz
, assuming you are running a Debian based operating system where apt
and pip
are available)
For an explanation of what dot
and neato
mean please see graphviz's website. These are two pieces of software (along with others) provided by graphviz which handle drawing of graphs (and they can be called at the command line). I have personally used them with hundreds of thousands of edges on Amazon's EC2 and while the node layout might appear to take some time, they will produce an output.
In terms of monitoring the whole process, you can issue a top
command from a(nother) terminal and check what the process is doing but that would answer simple questions such as "Has the process stop?", "Does it keep consuming memory?" and "What percentage of the CPU time does it use right this instance?", it will not answer questions such as "What percentage of the graph has been laid out and drawn so far?". For more information about top
please see this link.
Hope this helps.
Upvotes: 3
Reputation: 9818
Networkx is really not suited for the task. It is very slow. In addition, matplotlib (nx.draw) will never succeed to draw that many objects.
If you want to visualize you will need a tool to see each step of the layout where you could possibly modify what's going on.
Even though it is buggy, I would recommend Gephi for this. The only layout algorithm which works for large graphs is OpenOrd (Gephi plug-ins). Don't forget not to show edges while you run the algorithm.
As a general purpose library to handle your scale of graphs I would recommend graph-tool. With a C++ backend and a python interface it is much faster than networkx. The drawing is also better.
Finally when you reach a million node scale, you can switch to large graph-analytics frameworks such as Graphlab-Create or Apache GraphX.
Upvotes: 8