Reputation: 665
I am making a website and I want to store all users posts in one table ordered by the time they post it. the cassandra data model that I made is this
CREATE TABLE Posts(
ID uuid,
title text,
insertedTime timestamp,
postHour int,
contentURL text,
userID text,
PRIMARY KEY (postHour, insertedTime)
) WITH CLUSTERING ORDER BY (insertedTime DESC);
The question I'm facing is, when a user visits the posts page, it fetches the most recent ones by querying
SELECT * FROM Posts WHERE postHour = ?;
? = current hour
so far when the user scrolls down ajax requests are made to get more posts from the server. Javascript keeps track of postHour of the lastFetched item and sends back to the server along with the cassandra PagingState when requesting for new posts.
but this approach will query more than 1 partition when user scrolls down. I want to know whether this model would perform without a problem, is there any other model that I can follow.
Someone please point me in the right direction. Thank You.
Upvotes: 0
Views: 855
Reputation: 1385
That's a good start but a few pointers:
You'll probably need more than just the postHour
as the partition key. I'm guessing you don't want to store all the posts regardless of the day together and then page through them. What you're probably are after here is:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime)
But there's still a problem. Your PRIMARY KEY
has to uniquely identify a row (in this case a post). I'm going to guess it's possible, although not likely, that two users might make a post with the same insertedTime
value. What you really need then is to add the ID
to make sure they are unique:
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime, ID)
At this point, I'd consider just combining your ID
and insertedTime
columns into a single ID
column of type timeuuid
. With those changes, your final table looks like:
CREATE TABLE Posts(
ID timeuuid,
postYear int,
postMonth int,
postDay int,
postHour int,
title text,
contentURL text,
userID text,
PRIMARY KEY ((postYear, postMonth, postDay, postHour), ID)
) WITH CLUSTERING ORDER BY (ID DESC);
Whatever programming language you're using should have a way to generate a timeuuid
from the inserted time and then extract that time from a timeuuid
value if you want to show it in the UI or something. (Or you could use the CQL timeuuid functions for doing the converting.)
As to your question about querying multiple partitions, yes, that's totally fine to do, but you could run into trouble if you're not careful. For example, what happens if there is a 48 hour period with no posts? Do you have to issue 48 queries that return empty results before finally getting some back on your 49th query? (That's probably going to be really slow and a crappy user experience.)
There are a couple things you could do to try and mitigate that:
Create a second table to keep track of which partitions actually have posts in them. For example, if you were to stick with posts by hour, you could create a table like this:
CREATE TABLE post_hours (
postYear int,
postMonth int,
postDay int,
postHour int,
PRIMARY KEY (postYear, postMonth, postDay, postHour)
);
You'd then insert into this table (using a Batch) anytime a user adds a new post. You can then query this table first before you query the Posts
table to figure out which partitions have posts and should be queried (and thus avoid querying a whole bunch of empty partitions).
Upvotes: 3