BDuelz
BDuelz

Reputation: 3928

How to do this in one select query?

I need to display a list of posts. For each post, I need to also show:

  1. How many people "like" the post.
  2. Three names of those who "like" the post (preferably friends of viewing user).
  3. If the viewing user "likes" the post, I'd like for him/her to be one of the three.

I don't know how to do it without querying for each item in a for loop, which is proving to be very slow. Sure caching/denormalization will help, but I'd like to know if this can be done otherwise. How does facebook do it?

Assuming this basic db structure, any suggestions?

users
-----
id
username

posts
---------
id
user_id
content

friendships
-----------
user_id
friend_id
is_confirmed (bool)

users_liked_posts
-----------------
user_id
post_id

As a side note, if anyone knows how to do this in SQLAlchemy, that would very much appreciated.

EDIT: SQLFiddle http://sqlfiddle.com/#!2/9e703

Upvotes: 0

Views: 157

Answers (2)

Bret Copeland
Bret Copeland

Reputation: 24000

@koriander gave the SQL answer, but as to how Facebook does it, you already partially answered that; they use highly denormalized data, and caching. Also, they implement atomic counters, in-memory edge lists to perform graph traversals, and they most certainly don't use relational database concepts (like JOIN's) since they don't scale. Even the MySQL clusters they run are essentially just key/value pairs which only get accessed when there's a miss in the cache layer.

Instead of an RDBS, I might suggest a graph database for your purposes, like neo4j

Good luck.

EDIT:

You're really going to have to play with Neo4j if you're interested in using it. You may or may not find it easier coming from a SQL background, but it will certainly provide more powerful, and likely faster, queries for performing graph traversals.

Here's a couple examples of Cypher queries which may be useful to you.

Count how many people like a post:

START post=node({postId})
MATCH post<-[:like]-user
RETURN count(*)

(really you should use an atomic counter, instead, if it's something you're going to be querying for a lot)

Get three people who liked a post with the following constraints:

  1. The first likingUser will always be the current user if he/she liked the post.
  2. If friends of the current user liked the post, they will show up before any non-friends.
START post=node({postId}), user=node({currentUserId}) 
MATCH path = post<-[:like]-likingUser-[r?:friend*0..1]-user 
RETURN likingUser, count(r) as rc, length(path) as len 
ORDER BY rc desc, len asc
LIMIT 3

I'll try to explain the above query... if I can.

  1. Start by grabbing two nodes, the post and the current user
  2. Match all users who like the post (likingUser)
  3. Additionally, test whether there is a path of length 0 or 1 which connects likingUser through a friendship relationship to the current user (a path of length 0 indicates that likingUser==user).
  4. Now, order the results first by whether or not relationship r exists (it will exist if the likingUser is friends with user or if likingUser==user). So, count(r) will be either 0 or 1 for each result. Since we prefer results where count(r)==1, we'll sort this in descending order.
  5. Next, perform a secondary sort which forces the current user to the top of the list if he/she was part of the results set. We do this by checking the length of path. When user==likingUser, the path length will be shorter than when user is a friend of likingUser, so we can use length(path) to force user up to the top by sorting in ascending order.
  6. Lastly, we limit the results to only the top three results.

Hopefully that makes some sense. As a side note, you may actually get better performance by separating out your queries. For example, one query to see if the user likes the post, then another to get up to three friends who liked the post, and finally another to get up to three non-friends who like the post. I say it may be faster because each query can short-circuit after it gets three results, whereas the big single-query I wrote has to consider all possibilities, then sort them. So, just keep in mind that just because you can combine multiple questions into a single query, it may actually perform worse than multiple queries.

Upvotes: 1

koriander
koriander

Reputation: 3258

You can try this in your sqlfiddle. The condition "WHERE user_id = 2" needs 2 replaced by your current user id.

SELECT numbered.*
FROM
 (SELECT ranked.*,
       IF (post_id=@prev_post,
           @n := @n + 1,
           @n := 1 AND @prev_post := post_id) as position
 FROM
   (SELECT users_liked_posts.post_id,
          users_liked_posts.user_id,
          visitor.user_id as u1,
          friendships.user_id as u2,
          IF (visitor.user_id is not null, 1, IF(friendships.user_id is not null, 2, 3)) as rank
   FROM   users_liked_posts
   INNER JOIN posts
   ON     posts.id = users_liked_posts.post_id
   LEFT JOIN friendships
   ON     users_liked_posts.user_id = friendships.user_id
   AND    friendships.friend_id = posts.user_id
   LEFT JOIN (SELECT post_id, user_id FROM users_liked_posts WHERE user_id = 2) visitor
   ON     users_liked_posts.post_id = visitor.post_id
   AND    users_liked_posts.user_id = visitor.user_id
   ORDER BY users_liked_posts.post_id, rank) as ranked
   JOIN
   (SELECT @n := 0, @prev_post := 0) as setup) as numbered
WHERE numbered.position < 4

You can easily join subquery "numbered" with table "users" to obtain additional user information. There are extra fields u2, u3 to help see what is happening. You can remove these.

General idea of the query:

1) left join users_liked_posts with itself two times. The first time it is restricted to current visitor, creating subquery visitors. The second time is restricted to friends.

2) the column rank, IF (visitor.user_id is not null, 1, IF(friendships.user_id is not null, 2, 3)), assigns a rank to each user in users_liked_posts. This query is sorted by post and by rank.

3) use the previous as a subquery to create the same data but with a running position for the users, per post.

4) use the previous as a subquery to extract the top 3 positions per post.

No, these steps can not be merged, in particular because MySQL does not allow a computed column to be used by alias in the WHERE condition.

Upvotes: 2

Related Questions