fabi
fabi

Reputation: 474

Representation of a simple undirected graph

I need your expertise:

I am about to implement a graph class in c++ and thinking about the right representation. The graphs are simple and undirected. Number of vertices get for now just up to 1000 but maybe higher in the future. Number of edges up to 200k and maybe higher. Each vertex got a color (int) and an id (int). Edges transport no more information than connecting to vertices.

I store the graph and just need to access if x and y are connected or not - this I need very often. After initialising i never remove or add new vertices or edges (N = Number of vertices and M=number of edges given from the start)!

The one representation which is already available to me:

An adjacency list rolled out into just one long list. Along with this representation goes an array with starting indices for each vertex. Storage O(2M) and check if edge between x and y in an average of O(n/m)

A representation I thought of:

the idea is to, instead of rolling out the adjacency list into one array, do it with the adjacency matrix. So storage O(N^2)? Yes but I want to store an edge in one bit except of one byte.(actually 2 bits symmetricallywise) Example: Say N=8, then create an vector<uint8_t> of length 8 (64 bit). Init each entry on 0. If there is an edge between vertex 3 and vertex 5, then add pow(2,5) to the entry of my vector belonging to vertex 3 and symmetrically. So there is a 1 in the entry of vertex 3 at position of vertex 5 exactly when there is an edge between 3 and 5. After inserting my graph into this data structure I think one should be able to access neighborhood in constant time by just a binary operation: Are 3 and 5 connected? Yes if v[3] ^ pow(2,5) == 0. When there are more vertices than 8, then every vertex needs to get more than one entry in the vector and I need to perform one modulo and one division operation for accessing the correct spot.

What do you think of the second solution - is it maybe already known and in use? Am I wrong by thinking about an access of O(1)? Is it to much effort for no real performance improvement?

The reason for loading both representations in one big list is due to cache improvements I was told.

I am happy to get some feedback on this idea. I might be way off - pls be kind in that case :D

Upvotes: 1

Views: 1189

Answers (2)

Ivan86
Ivan86

Reputation: 5708

A 1000x1000 matrix with 200,000 edges will be quite sparse. Since the graph is undirected, the edges in the matrix will be written twice:

VerticeA -> VerticeB   and   VerticeB -> VerticeA

You will end up filling up 40% of the matrix, the rest will be empty.


Edges

The best approach I can think of here is to use a 2D vector of booleans:

std::vector<std::vector<bool>> matrix(1000, std::vector<bool>(1000, false));

The lookup will take O(1) time and std::vector<bool> saves space by using a single bit for each boolean value. You will end up using 1Mbit or ~128kB (125 kB) of memory.

The storage is not necessarily an array of bool values, but the library implementation may optimize storage so that each value is stored in a single bit.

This will allow you to check for an edge like this:

if( matrix[3][5] )
{
    // vertice 3 and 5 are connected
}
else
{
    // vertice 3 and 5 are not connected
}

Vertices

If the id values of the vertices form a continuous range of ints (e.g. 0,1,2,3,...,999) then you could store the color information in a std::vector<int> which has O(1) access time:

std::vector<int> colors(1000);

This would use up memory equal to:

1000 * sizeof(int) = 4000 B ~ 4 kB (3.9 kB)

On the other hand, if the id values don't form a continuous range of ints it might be a better idea to use a std::unordered_map<int, int> which will on average give you O(1) lookup time.

std::unordered_map<int, int> map;

So e.g. to store and look up the color of vertice 4:

map[4] = 5;            // assign color 5 to vertice 4
std::cout << map[4];   // prints 5

The amount of memory used by std::unordered_map<int, int> will be:

1000 * 2 * sizeof(int) = 8000 B ~ 8 kB (7.81 kB)

All together, for edges:

Type Memory Access time
std::vector<std::vector<bool>> 125 kB O(1)

and for vertices:

Type Memory Access time
std::vector<int> 3.9 kB O(1)
std::unordered_map<int, int> 7.8 kB O(1) on avg.

Upvotes: 2

Surt
Surt

Reputation: 16129

If you go for a bit matrix then the memory usage is O(V^2), so ~1Mb bits or 128KB, of which slightly less than half really are duplicates.

If you make an array of the edges O(E) and another array of index into the edges from the vertexes to the first of its edge you use 200K*sizeof(int) or 800KB which is much more, half of it is also duplicates (A-B and B-A are the same) which here actually could be saved. Same if you know (or can template you out of it) that the number of vertexes can be stored in an uint16_t half can be saved again.

To save half you just check which of the Vertexes has the lower number and checks its edges.

To find out when to stop looking you use the index on the next Vertex.

So with your numbers it is fine or even good to use a bit matrix.

The first problem comes when (V^2)/8 > (E*4) though the binary search in the Edge algorithm would still be much slower than checking a bit. That would occur if we set E = V * 200 (1000 Vertexes vs 200K edges)

V*V/8 > V*200*4
V/8 > 200*4
V > 200*4*8 = 6400

That would be 5120000 ~ 5MB easily fits into a L3 cache nowadays. If the connectivity (here average number of connections per vertex) is higher than 200 so much the better.

Checking the edges will also cost lg2(connectivity)*K(mispredicts) which gets rather steep. checking the bit matrix would be O(1).

You would need to measure, among others when the bit matrix breaks the L3 significantly while the Edge list still fits L3 and when it spills over in virtual memory.

In other words with a high connectivity the bit matrix should beat the Edge list with a much lower connectivity or much higher number of vertexes the Edge list might be faster.

Upvotes: 1

Related Questions