Reputation: 6789
I have a table with friends id, u1,
u2
and about < 500,000
entries on a single mysql server
and i want to take userA
and userB
and check whether they have any mutual friends.
Is it faster to do
select u2 from friends where u1 = userA and u2 IN (select u2 from friends where u1 = userB)
than to run a shortest path algorithm on a graph (on one server)?
What is the standard way, big networks like LinkedIn and Facebook use to handle this?
Thanks!
Upvotes: 4
Views: 996
Reputation: 1211
In a graph database you could write your query in gremlin as:
g.V('username','userB').out('friend').retain(g.V('username','userA').out('friend').gather)
Most graph databases should execute this quickly.
If you use Titan, you can additionally exploit that Titan maintains adjacent vertices in sort order, which mean that you can compute the intersection of the two friend lists using only one iteration over the data and without creating additional data structures. This will likely be faster than MySQL and a lot faster if the average number of friends is large.
Upvotes: 2
Reputation: 14730
Here's another take for second degree connections using a simple inner join
:
select fA.u2
from friends fA
inner join friends fB on
fA.u2 = fB.u2
where fA.u1 = userA and
fB.u1 = userB
This is the same approach as a many to many type query. You don't need to use a shortest path for that level of relationship.
If you wish to look for larger degree relationships, then you should look into adjacency lists, but it's not easy to implement it using MySQL. There are some issues to really look after in that setup:
to name a few.
Upvotes: 0
Reputation: 1269793
In MySQL, the query that you wrote will be slower than any other way of finding this information. Perhaps slower than asking each person individually. Your query:
select u2
from friends
where u1 = userA and
u2 IN (select u2 from friends where u1 = userB)
Has a subquery in the IN clause. MySQL evaluates the query for every row encountered. A better way to write this is:
select u2
from friends
where u1 = userA and
exists (select 1 from friends where u1 = userB limit 1)
If your data all fits on one server and fits into memory, the performance of an optimized MySQL query should be fine. Sites such as LinkedIn and FaceBook are dealing with a myriad of issues -- constant updates to the network, vastly larger amounts of data, different types of links, and so on. Your simple example is not representative of what they are doing. But, many of their analyses use Hadoop or Hadoop in conjunction with relational databases.
Upvotes: 2
Reputation: 13525
If the table friends is indexed by both u1 and u2, then SQL query is to take intersection of 2 subsets and is pretty fast. It is because indexing is already done. If you do computations in memory, time depends on whether you have prebuilt indexes: if you have, you'll be faster because of no database connection overhead. If indexing is included in computational time, and database is warmed (all data in memory), you can lost.
I'm talking on indexing, not shortest path algorithm, because shortest path algorithm computes more data than you need.
Upvotes: 2