Reputation: 3928
I need to display a list of posts. For each post, I need to also show:
I don't know how to do it without querying for each item in a for loop, which is proving to be very slow. Sure caching/denormalization will help, but I'd like to know if this can be done otherwise. How does facebook do it?
Assuming this basic db structure, any suggestions?
users
-----
id
username
posts
---------
id
user_id
content
friendships
-----------
user_id
friend_id
is_confirmed (bool)
users_liked_posts
-----------------
user_id
post_id
As a side note, if anyone knows how to do this in SQLAlchemy, that would very much appreciated.
EDIT: SQLFiddle http://sqlfiddle.com/#!2/9e703
Upvotes: 0
Views: 157
Reputation: 24000
@koriander gave the SQL answer, but as to how Facebook does it, you already partially answered that; they use highly denormalized data, and caching. Also, they implement atomic counters, in-memory edge lists to perform graph traversals, and they most certainly don't use relational database concepts (like JOIN's) since they don't scale. Even the MySQL clusters they run are essentially just key/value pairs which only get accessed when there's a miss in the cache layer.
Instead of an RDBS, I might suggest a graph database for your purposes, like neo4j
Good luck.
EDIT:
You're really going to have to play with Neo4j if you're interested in using it. You may or may not find it easier coming from a SQL background, but it will certainly provide more powerful, and likely faster, queries for performing graph traversals.
Here's a couple examples of Cypher queries which may be useful to you.
Count how many people like a post:
START post=node({postId})
MATCH post<-[:like]-user
RETURN count(*)
(really you should use an atomic counter, instead, if it's something you're going to be querying for a lot)
Get three people who liked a post with the following constraints:
likingUser
will always be the current user if he/she liked the post.START post=node({postId}), user=node({currentUserId}) MATCH path = post<-[:like]-likingUser-[r?:friend*0..1]-user RETURN likingUser, count(r) as rc, length(path) as len ORDER BY rc desc, len asc LIMIT 3
I'll try to explain the above query... if I can.
post
and the current user
likingUser
)likingUser
through a friendship relationship to the current user
(a path of length 0 indicates that likingUser==user
).r
exists (it will exist if the likingUser
is friends with user
or if likingUser==user
). So, count(r)
will be either 0 or 1 for each result. Since we prefer results where count(r)==1
, we'll sort this in descending order.user
to the top of the list if he/she was part of the results set. We do this by checking the length of path
. When user==likingUser
, the path length will be shorter than when user
is a friend of likingUser
, so we can use length(path)
to force user
up to the top by sorting in ascending order.Hopefully that makes some sense. As a side note, you may actually get better performance by separating out your queries. For example, one query to see if the user likes the post, then another to get up to three friends who liked the post, and finally another to get up to three non-friends who like the post. I say it may be faster because each query can short-circuit after it gets three results, whereas the big single-query I wrote has to consider all possibilities, then sort them. So, just keep in mind that just because you can combine multiple questions into a single query, it may actually perform worse than multiple queries.
Upvotes: 1
Reputation: 3258
You can try this in your sqlfiddle. The condition "WHERE user_id = 2" needs 2 replaced by your current user id.
SELECT numbered.*
FROM
(SELECT ranked.*,
IF (post_id=@prev_post,
@n := @n + 1,
@n := 1 AND @prev_post := post_id) as position
FROM
(SELECT users_liked_posts.post_id,
users_liked_posts.user_id,
visitor.user_id as u1,
friendships.user_id as u2,
IF (visitor.user_id is not null, 1, IF(friendships.user_id is not null, 2, 3)) as rank
FROM users_liked_posts
INNER JOIN posts
ON posts.id = users_liked_posts.post_id
LEFT JOIN friendships
ON users_liked_posts.user_id = friendships.user_id
AND friendships.friend_id = posts.user_id
LEFT JOIN (SELECT post_id, user_id FROM users_liked_posts WHERE user_id = 2) visitor
ON users_liked_posts.post_id = visitor.post_id
AND users_liked_posts.user_id = visitor.user_id
ORDER BY users_liked_posts.post_id, rank) as ranked
JOIN
(SELECT @n := 0, @prev_post := 0) as setup) as numbered
WHERE numbered.position < 4
You can easily join subquery "numbered" with table "users" to obtain additional user information. There are extra fields u2, u3 to help see what is happening. You can remove these.
General idea of the query:
1) left join users_liked_posts with itself two times. The first time it is restricted to current visitor, creating subquery visitors. The second time is restricted to friends.
2) the column rank, IF (visitor.user_id is not null, 1, IF(friendships.user_id is not null, 2, 3)), assigns a rank to each user in users_liked_posts. This query is sorted by post and by rank.
3) use the previous as a subquery to create the same data but with a running position for the users, per post.
4) use the previous as a subquery to extract the top 3 positions per post.
No, these steps can not be merged, in particular because MySQL does not allow a computed column to be used by alias in the WHERE condition.
Upvotes: 2