news feed like time-series data on cassandra

Question

I am making a website and I want to store all users posts in one table ordered by the time they post it. the cassandra data model that I made is this

CREATE TABLE Posts(
   ID uuid,
   title text,
   insertedTime timestamp,
   postHour int,
   contentURL text,
   userID text,
   PRIMARY KEY (postHour, insertedTime)
) WITH CLUSTERING ORDER BY (insertedTime DESC);

The question I'm facing is, when a user visits the posts page, it fetches the most recent ones by querying

SELECT * FROM Posts WHERE postHour = ?;

? = current hour

so far when the user scrolls down ajax requests are made to get more posts from the server. Javascript keeps track of postHour of the lastFetched item and sends back to the server along with the cassandra PagingState when requesting for new posts.

but this approach will query more than 1 partition when user scrolls down. I want to know whether this model would perform without a problem, is there any other model that I can follow.

Someone please point me in the right direction. Thank You.

Luke Tillman · Accepted Answer

That's a good start but a few pointers:

You'll probably need more than just the postHour as the partition key. I'm guessing you don't want to store all the posts regardless of the day together and then page through them. What you're probably are after here is:
```
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime)
```
But there's still a problem. Your PRIMARY KEY has to uniquely identify a row (in this case a post). I'm going to guess it's possible, although not likely, that two users might make a post with the same insertedTime value. What you really need then is to add the ID to make sure they are unique:
```
PRIMARY KEY ((postYear, postMonth, postDay, postHour), insertedTime, ID)
```
At this point, I'd consider just combining your ID and insertedTime columns into a single ID column of type timeuuid. With those changes, your final table looks like:
```
CREATE TABLE Posts(
  ID timeuuid,
  postYear int,
  postMonth int,
  postDay int,
  postHour int,
  title text,
  contentURL text,
  userID text,
  PRIMARY KEY ((postYear, postMonth, postDay, postHour), ID)
) WITH CLUSTERING ORDER BY (ID DESC);
```
Whatever programming language you're using should have a way to generate a timeuuid from the inserted time and then extract that time from a timeuuid value if you want to show it in the UI or something. (Or you could use the CQL timeuuid functions for doing the converting.)

As to your question about querying multiple partitions, yes, that's totally fine to do, but you could run into trouble if you're not careful. For example, what happens if there is a 48 hour period with no posts? Do you have to issue 48 queries that return empty results before finally getting some back on your 49th query? (That's probably going to be really slow and a crappy user experience.)

There are a couple things you could do to try and mitigate that:

Make your partitions less granular. For example, instead of doing posts by hour, make it posts by day, or posts by month. If you know that those partitions won't get too large (i.e. users won't make so many posts that the partition gets huge), that's probably the easiest solution.
Create a second table to keep track of which partitions actually have posts in them. For example, if you were to stick with posts by hour, you could create a table like this:
```
CREATE TABLE post_hours (
  postYear int,
  postMonth int,
  postDay int,
  postHour int,
  PRIMARY KEY (postYear, postMonth, postDay, postHour)
);
```
You'd then insert into this table (using a Batch) anytime a user adds a new post. You can then query this table first before you query the Posts table to figure out which partitions have posts and should be queried (and thus avoid querying a whole bunch of empty partitions).

news feed like time-series data on cassandra

Answers (1)

Related Questions