Avinash
Avinash

Reputation: 13277

huge Graph storage issue

I want to understand how to store a graph with huge data. I am designing an application which has a graph of huge railway route network. Where vertices are the railway station name. I have designed using adjacency list in C++. But now i found that it is consuming very high memory and sometime i also get no-memory error. I was wondering how such huge graph are stored so that algorithm on the graph can be used.

Graph is defined as

std::map<std::string, std::set<std::string> > railway_graph;

or how does google/facebook store there graph data structure.

Upvotes: 3

Views: 1012

Answers (5)

FUD
FUD

Reputation: 5184

Take a look at

http://redis.io/

what i would propose is to take your map and convert it to a map in redis, which can then be persisted on local file system. Look up is really fast and should not hurt performance a lot.

Upvotes: 0

Frunsi
Frunsi

Reputation: 7157

Your choice of data structure will require a lot superfluous memory, dynamically allocated on the heap. std::map and std::string will allocate a piece of memory for each single entry (plus its own overhead). std::string will also allocate a piece of memory for the string.

This is comfortable and totally ok for many cases. But not ok for large data structures.

In the end you have a map, which contains pointers (which itself were allocated one by one) to sets, which contains pointers (which itself were allocated one by one) to strings, which contain pointers to the actual string buffers.

Your actual problem is the overhead that dynamic allocation incurs. On most platforms, a heap allocation requires an extra 16-byte of memory just for heap management (though the numbers vary...).

I suggest, that you re-define your graph in the following way:

// a list of node names, its index (a size_t) is used in the following data structures
// - alternatively, you may use an std::map<int,std::string> here, to simplify the
//   "index" to "name" lookup...
typedef size_t NodeId;
typedef std::vector<std::string> NodeList;

// an edge
typedef std::pair<NodeId,NodeId> Edge;
// or alternatively:
struct Edge {
    NodeId from, to;
};

// a plain list of edges
typedef std::vector<Edge> EdgeList;

Or, alternatively the following data structures may be easier for your use cases. It is similar to your example, but is much more compact in memory representation:

// a list of node names, its index (a size_t) is used in the following data structures
typedef size_t NodeId;
typedef std::vector<std::string> NodeList;

typedef std::vector<NodeId> NodeIdList;

// a map from one node to its adjacent nodes
typedef std::map< NodeId, NodeIdList > Graph;

EDIT: Added and used NodeIdList ...

If this still consumes too much memory, then you should think about keeping data on disk and loading it on demand.

If your node names are constant, then you should also think about some kind of string-table, a more compact representation of string data in memory. But this is rather low-level stuff.

Try to use better data structures first!

Upvotes: 0

Pratik Deoghare
Pratik Deoghare

Reputation: 37222

class Node
{
   string id; 
   Data data; // fetch data by ID when required from some database 
}

You can store data related to each station in a database and fetch it by id as and when required.

Graph density is defined as D = 2|E|/(|V|(|V|-1)). You have to design data structure depending upon D.

If you have dense graph then you can use matrix representation. You will require only |V|*|V| bits approximately.

For sparse graph edge list representation is good.

Upvotes: 0

Bowie Owens
Bowie Owens

Reputation: 3006

Using map and string like that will add a lot of redundant memory use. If you store the names in one vector and the adjacency list using just integer indices it should be a lot more compact.

std::vector<std::string> name;
std::vector<std::vector<size_t> > adj_list;

Upvotes: 0

Z&#233;ychin
Z&#233;ychin

Reputation: 4205

Using an adjacency matrix representation instead of an adjacency list can reduce memory allocation for dense matrices.

Because you didn't mention what the size of the system is or what types of algorithms you are attempting to run, it is hard to judge whether your algorithm needs to be checked for inappropriate memory consumption, or if you actually need to make use of files as intermittent "memory" throughout your program in order to make the calculation possible.

Upvotes: 1

Related Questions