Graph Data Structures with millions of nodes (Social network)

Question

In the context of design of a social network using Graphs data structure, where you can perform a BFS to find a connection from one person to another, I have some questions pertaining to it.

If there are million users, the topology would indeed be much more complicated and interconnected than the graphs we normally design and I am trying to comprehend how you could solve these problems.

In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use this data to pick where you start traverse?

Salvatore Previti · Accepted Answer

Your question seems interesting and curious :)

1) Well... of course, data is stored in disks, not in ram. Disks have systems that avoid failure, in particular, RAID-5 for example. Redundancy is the key: if one system fail there is another system ready to take his place. There is also redundancy and workload sharing together... there are two computers that work in parallel and share their jobs but if one stops only one works and take the full workload.

In places like google or facebook redundancy is not 2, is 1200000000 :) And consider also that data is not in a single server farm, in google there are several datacenters connected together, so if one building explodes, another one will take his place for example.

2) Not an easy question at all, but usually these systems have big cache for disk arrays too, so reading and writing data on disk is faster than on our laptops :) Data can be processed in parallel by several concurrent systems and this is the key of the speed of services like facebook.

3) The end of the graph is not infinite. So it is possible with actual technology indeed.

The computational complexity of exploring all connections and all nodes on a graph is O(n + m) where n is the number of vertices and m the number of edges. This means, it is linear to the number of registered user and to the number of connection between users. And RAM these days is very cheap.

Being a linear growth is easy to add resources when needed. Add more computers the more you get rich :)

Consider also that no-one will perform a real search for every node, everything in facebook is quite "local", you can view the direct friend of one person, not the friend of friend of friend .... it would be not useful.

Getting the number of vertices directly connected to a vertex, if the data structure is well done, is very easy and fast. In SQL it would be a simple select and if tables are well indexed it will be very fast and also not very dependant on the total number of users (see the concept of hash tables).

Graph Data Structures with millions of nodes (Social network)

Answers (1)

Related Questions