Reputation: 2296
I'm developing an application in which I need a structure to represent a huge graph (between 1000000 and 6000000 nodes and 100 or 600 edges per node) in memory. The edges representation will contain some attributes of the relation.
I have tried a memory map representation, arrays, dictionaries and strings to represent that structure in memory, but these always crash because of the memory limit.
I would like to get an advice of how I can represent this, or something similar.
By the way, I'm using python.
Upvotes: 11
Views: 7921
Reputation: 3139
The scipy.sparse.csgraph package may be able to handle this -- 5 million nodes * 100 edges on average is 500 million pairs, at 8 bytes per pair (two integer IDs) is just about 4GB. I think csgraph
uses compression so it will use less memory than that; this could work on your laptop.
csgraph doesn't have as many features as networkx but it uses waaay less memory.
Upvotes: 3
Reputation: 18237
If the only reason you need this in memory is because you need to be able to read and write it fast, then use a database. Databases read and write extremely fast, often they can read without going to disk at all.
Upvotes: 14
Reputation: 13927
Depending on you hardware resources an all in memory for a graph this size is probably out of the question. Two possible options from a graph specific DB point of view are:
Since your using Python, have you looked at Networkx? How far did you get loading a graph of this size if you have looked at it out of interest?
Upvotes: 6
Reputation: 7061
Assuming you mean 600 per node, you could try something like this:
import os.path
import cPickle
class LazyGraph:
def __init__(self,folder):
self.folder = folder
def get_node(self,id):
f = open(os.path.join(self.folder,str(id)),'rb')
node = cPickle.load(f)
f.close() # just being paranoid
return node
def set_node(self,id,node):
f = open(os.path.join(self.folder,str(id)),'wb')
cPickle.dump(node,f,-1) # use highest protocol
f.close() # just being paranoid
Use arrays (or numpy arrays) to hold the actual node ids, as they are faster.
Note, this will be very very slow.
You could use threading to pre-fetch nodes (assuming you knew which order you were processing them in), but it won't be fun.
Upvotes: 1
Reputation: 6794
If you do decide to use some kind of database after all, I suggest looking at neo4j and its python bindings. It's a graph database capable of handling large graphs. Here's a presentation from this year's PyCon.
Upvotes: 0
Reputation: 9426
Sounds like you need a database and an iterator over the results. Then you wouldn't have to keep it all in memory at the same time but you could always have access to it.
Upvotes: 0
Reputation: 55009
You appear to have very few edges considering the amount of nodes - suggesting that most of the nodes aren't strictly necessary. So, instead of actually storing all of the nodes, why not use a sparse structure and only insert them when they're in use? This should be pretty easy to do with a dictionary; just don't insert the node until you use it for an edge.
The edges can be stored using an adjacency list on the nodes.
Of course, this only applies if you really mean 100-600 nodes in total. If you mean per node, that's a completely different story.
Upvotes: 3
Reputation: 89661
I doubt you'll be able to use a memory structure unless you have a LOT of memory at your disposal:
Assume you are talking about 600 directed edges from each node, with a node being 4-bytes (integer key) and a directed edge being JUST the destination node keys (4 bytes each).
Then the raw data about each node is 4 + 600 * 4 = 2404 bytes x 6,000,000 = over 14.4GB
That's without any other overheads or any additional data in the nodes (or edges).
Upvotes: 4