Reputation: 780
I need to store latest updates that needs to be pushed to users' newsfeed page in Cassandra table for later retrieval and my table's schema is as follow:
CREATE TABLE newsfeed (user_name text,
post_id bigint,
post_type text,
favorited boolean,
shared boolean,
own boolean,
date timestamp,
PRIMARY KEY (user_name,date,post_id,post_type) );
The first three column (username, postid, and posttype) in combination will build the actual primary-key of the table, however since I wanted to ORDER the SELECT queries on this table based on "date"s of rows I placed the date-column into the primary key fields as the "second" entry (did I have to do this?).
When I want to delete a row by giving only "user_name, post_id, and post_type" as follow:
DELETE FROM newsfeed WHERE user_name='pooria' and post_id=36 and post_type='p';
I will get the following error:
Bad Request: Missing PRIMARY KEY part date since post_id is set
I need the date-column to be part of the primary key since I want to use it in my ORDER BY clauses and on the other hand I have to delete some rows without knowing their "date" values!
So how such problems are tackled in Cassandra? should I be fixing my Data Model and have different schema for job?
Upvotes: 2
Views: 5440
Reputation: 57843
DataStax's Chief Evangelist Patrick McFadden posted an article demonstrating a few time series modeling patterns. Definitely makes for a good read, and should be of some help to you: Getting Started with Time Series Data Modeling.
I think your table is just fine. Although, with the way that composite primary keys work in Cassandra, if you cannot skip primary key components in a query. So if you do end up needing to query data by user_name
, post_id
, and/or post_type
differently (without date), you should create a table specifically for that query (which does not include date in the primary key).
I will however say that in-general, creating a table which will process regular delete operations is not a good idea. In fact, I'm pretty sure that has been classified as a Cassandra "anti-pattern." Data really isn't deleted from Cassandra; it is tombstoned. Tombstones are reconciled at compaction time (assuming that the tombstone threshold time has been met), and having too many of them has been known to cause performance issues.
If you read the article I linked above, go down to the section named "Time Series Pattern 3." You will notice that the INSERT
statements are run with the USING TTL
clause. This gives the data a time-to-live in seconds, after which it will "quietly disappear." For instance, if you wanted to keep your data around for 24 hours (86400 seconds) you could do something like this:
INSERT INTO newsfeed (...) VALUES (...) USING TTL 86400
Using the TTL feature is a preferable alternative to regular cleansing by DELETE
.
Upvotes: 4