Is a database query faster than algorithms for finding LinkedIn type 2nd degree connections on one server?

I have a table with friends id, u1, u2 and about < 500,000 entries on a single mysql server

and i want to take userA and userB and check whether they have any mutual friends.

Is it faster to do

select u2 from friends where u1 = userA and u2 IN (select u2 from friends where u1 = userB)

than to run a shortest path algorithm on a graph (on one server)?

What is the standard way, big networks like LinkedIn and Facebook use to handle this?

Thanks!

Upvotes: 4

Answers (5)

Matthias Broecheler

Reputation: 1211

In a graph database you could write your query in gremlin as:

g.V('username','userB').out('friend').retain(g.V('username','userA').out('friend').gather)

Most graph databases should execute this quickly.

If you use Titan, you can additionally exploit that Titan maintains adjacent vertices in sort order, which mean that you can compute the intersection of the two friend lists using only one iteration over the data and without creating additional data structures. This will likely be faster than MySQL and a lot faster if the average number of friends is large.

Upvotes: 2

didierc

Reputation: 14730

Here's another take for second degree connections using a simple inner join:

select fA.u2 
from friends fA 
inner join friends fB on
           fA.u2 = fB.u2 
where fA.u1 = userA and
      fB.u1 = userB

This is the same approach as a many to many type query. You don't need to use a shortest path for that level of relationship.

If you wish to look for larger degree relationships, then you should look into adjacency lists, but it's not easy to implement it using MySQL. There are some issues to really look after in that setup:

disjoint graphs (could be handled by maintaining transitive closures on subgraphs, and merge them when required),
directed vs. undirected graph,
data distribution (another answer mentioned hadoop as a way to accelerate processing, but it requires a good partition scheme)

to name a few.

Upvotes: 0

Gordon Linoff

Reputation: 1269793

In MySQL, the query that you wrote will be slower than any other way of finding this information. Perhaps slower than asking each person individually. Your query:

select u2
from friends
where u1 = userA and
      u2 IN (select u2 from friends where u1 = userB)

Has a subquery in the IN clause. MySQL evaluates the query for every row encountered. A better way to write this is:

select u2
from friends
where u1 = userA and
      exists (select 1 from friends where u1 = userB limit 1)

If your data all fits on one server and fits into memory, the performance of an optimized MySQL query should be fine. Sites such as LinkedIn and FaceBook are dealing with a myriad of issues -- constant updates to the network, vastly larger amounts of data, different types of links, and so on. Your simple example is not representative of what they are doing. But, many of their analyses use Hadoop or Hadoop in conjunction with relational databases.

Upvotes: 2

Alexei Kaigorodov

Reputation: 13525

If the table friends is indexed by both u1 and u2, then SQL query is to take intersection of 2 subsets and is pretty fast. It is because indexing is already done. If you do computations in memory, time depends on whether you have prebuilt indexes: if you have, you'll be faster because of no database connection overhead. If indexing is included in computational time, and database is warmed (all data in memory), you can lost.

I'm talking on indexing, not shortest path algorithm, because shortest path algorithm computes more data than you need.

Upvotes: 2

Karussell

Reputation: 17375

You really have to try this out and compare on your own data. Have a look into cassovary, flockdb, neo4j etc

Personally I would do it in-memory as you have not that many entries. E.g. try out a BitSet where you can use fast bit-operations (AND).

Upvotes: 0

Is a database query faster than algorithms for finding LinkedIn type 2nd degree connections on one server?

Answers (5)

Related Questions