Reputation: 13277
I want to understand how to store a graph with huge data. I am designing an application which has a graph of huge railway route network. Where vertices
are the railway station name
. I have designed using adjacency list
in C++
. But now i found that it is consuming very high memory and sometime i also get no-memory
error. I was wondering how such huge graph are stored so that algorithm
on the graph can be used.
Graph is defined as
std::map<std::string, std::set<std::string> > railway_graph;
or how does google/facebook store there graph data structure.
Upvotes: 3
Views: 1012
Reputation: 5184
Take a look at
http://redis.io/
what i would propose is to take your map and convert it to a map in redis, which can then be persisted on local file system. Look up is really fast and should not hurt performance a lot.
Upvotes: 0
Reputation: 7157
Your choice of data structure will require a lot superfluous memory, dynamically allocated on the heap. std::map
and std::string
will allocate a piece of memory for each single entry (plus its own overhead). std::string
will also allocate a piece of memory for the string.
This is comfortable and totally ok for many cases. But not ok for large data structures.
In the end you have a map, which contains pointers (which itself were allocated one by one) to sets, which contains pointers (which itself were allocated one by one) to strings, which contain pointers to the actual string buffers.
Your actual problem is the overhead that dynamic allocation incurs. On most platforms, a heap allocation requires an extra 16-byte of memory just for heap management (though the numbers vary...).
I suggest, that you re-define your graph in the following way:
// a list of node names, its index (a size_t) is used in the following data structures
// - alternatively, you may use an std::map<int,std::string> here, to simplify the
// "index" to "name" lookup...
typedef size_t NodeId;
typedef std::vector<std::string> NodeList;
// an edge
typedef std::pair<NodeId,NodeId> Edge;
// or alternatively:
struct Edge {
NodeId from, to;
};
// a plain list of edges
typedef std::vector<Edge> EdgeList;
Or, alternatively the following data structures may be easier for your use cases. It is similar to your example, but is much more compact in memory representation:
// a list of node names, its index (a size_t) is used in the following data structures
typedef size_t NodeId;
typedef std::vector<std::string> NodeList;
typedef std::vector<NodeId> NodeIdList;
// a map from one node to its adjacent nodes
typedef std::map< NodeId, NodeIdList > Graph;
EDIT: Added and used NodeIdList
...
If this still consumes too much memory, then you should think about keeping data on disk and loading it on demand.
If your node names are constant, then you should also think about some kind of string-table, a more compact representation of string data in memory. But this is rather low-level stuff.
Try to use better data structures first!
Upvotes: 0
Reputation: 37222
class Node
{
string id;
Data data; // fetch data by ID when required from some database
}
You can store data related to each station in a database and fetch it by id
as and when required.
Graph density is defined as D = 2|E|/(|V|(|V|-1))
. You have to design data structure depending upon D
.
If you have dense graph then you can use matrix representation. You will require only |V|*|V| bits approximately.
For sparse graph edge list representation is good.
Upvotes: 0
Reputation: 3006
Using map and string like that will add a lot of redundant memory use. If you store the names in one vector and the adjacency list using just integer indices it should be a lot more compact.
std::vector<std::string> name;
std::vector<std::vector<size_t> > adj_list;
Upvotes: 0
Reputation: 4205
Using an adjacency matrix representation instead of an adjacency list can reduce memory allocation for dense matrices.
Because you didn't mention what the size of the system is or what types of algorithms you are attempting to run, it is hard to judge whether your algorithm needs to be checked for inappropriate memory consumption, or if you actually need to make use of files as intermittent "memory" throughout your program in order to make the calculation possible.
Upvotes: 1