Reputation: 13
Just for fun I am building a tweeter clone to get a better understanding of C*
All the suggested C* schemes that I have seen around are using more or less the same modeling technique. The issue is that I have my doubts about the scalability of modeling the twitter timeline in this fashion.
The problem: What will happen if I have a userA (rock star) or more that is extremely popular and is followed by 10k+ users? Each time the userA publishes a tweet we will have to insert into the timeline table 10k+ tweets for each of his followers.
Questions: Will this model really scale? Can anyone suggest me an alternative ways of modeling the timeline that can really scale?
C* Schema:
CREATE TABLE users (
uname text, -- UserA
followers set, -- Users who follow userA
following set, -- UserA is following userX
PRIMARY KEY (uname)
);
-- View of tweets created by user
CREATE TABLE userline (
tweetid timeuuid,
uname text,
body text,
PRIMARY KEY(uname, tweetid)
);
-- View of tweets created by user, and users he/she follows
CREATE TABLE timeline (
uname text,
tweetid timeuuid,
posted_by text,
body text,
PRIMARY KEY(uname, tweetid)
);
-- Example of UserA posting a tweet:
-- BATCH START
-- Store the tweet in the tweets
INSERT INTO tweets (tweetid, uname, body) VALUES (now(), 'userA', 'Test tweet #1');
-- Store the tweet in this users userline
INSERT INTO userline (uname, tweetid, body) VALUES ('userA', now(), 'Test tweet #1');
-- Store the tweet in this users timeline
INSERT INTO timeline (uname, tweetid, posted_by, body) VALUES ('userA', now(), 'userA', 'Test tweet #1');
-- Store the tweet in the public timeline
INSERT INTO timeline (uname, tweetid, posted_by, body) VALUES ('#PUBLIC', now(), 'userA', 'Test tweet #1');
-- Insert the tweet into follower timelines
-- findUserFollowers = SELECT followers FROM users WHERE uname = 'userA';
for (String follower : findUserFollowers('userA')) {
INSERT INTO timeline (uname, tweetid, posted_by, body) VALUES (follower, now(), 'userA', 'Test tweet #1');
}
-- BATCH END
Thanks in advance for any suggestions.
Upvotes: 0
Views: 162
Reputation: 3514
In my opinion the schema that you outlined or a similar one is best given the use case (see latest tweets user X subscribed for + see my tweets).
There are two gotchas, however.
I don't think Twitter uses Cassandra for storing tweets, probably for the same reasons you're starting to think about. The feed doesn't seem like a great idea for running on Cassandra, because you don't want to persist these countless copies of other people's tweets forever, but rather keep some sort of sliding window updated for each user (most users don't read 1000s of tweets down from the top of their feed, I'm guessing). So we're talking about a queue, and a queue that's in some cases updated essentially in real time. Cassandra can only support this pattern at the far end of scale with some coercion. I don't think it was designed for massive churn.
In production another database with better support for queues would probably be picked--maybe something like sharded Redis with its list support.
For the example you gave, the problem is not as bad as it may seem, because you don't need to do this update in a synchronous batch. You can post to the author's lists, return quickly and then do all other updates with an asynchronous worker that's running in the cluster pushing out updates with best effort QoS.
Finally, since you've asked about alternatives, here is a variation that I could think of. It may be conceptually closer to the queue I mentioned, but under the hood it will run into a lot of the same problems related to heavy data churn.
CREATE TABLE users(
uname text,
mru_timeline_slot int,
followers set,
following set,
PRIMARY KEY (uname)
);
// circular buffer: keep at most X slots for every user.
CREATE TABLE timeline_most_recent(
uname text,
timeline_slot int,
tweeted timeuuid,
posted_by text,
body text,
PRIMARY KEY(uname, timeline_slot)
);
Upvotes: 0