sangupta
sangupta

Reputation: 2406

Cassandra data-modelling around email-system

I need data modelling help as I haven't found a resource that tackles the same problem.

The user case is similar to an email-system. I want to store a timeline of all emails a user has received and then fetch them back with three different ways:

  1. All emails ever received
  2. Mails that have been read by a user
  3. Mails that are still unread by a user

My current model is as under:

CREATE TABLE TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    read boolean,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

CREATE INDEX ON TIMELINE (userID, read);

The queries I need to support are:

SELECT * FROM TIMELINE where userID = 12;
SELECT * FROM TIMELINE where userID = 12 order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = true;
SELECT * FROM TIMELINE where userID = 12 and read = false;
SELECT * FROM TIMELINE where userID = 12 and read = true order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = false order by timestamp asc;

My queries are:

  1. Should I keep read as my secondary index as It will be frequently updated and can create tombstones - per http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html its a problem.

  2. Can we do inequality check on secondary index because i found out that atleast one equality condition should be present on secondary index

  3. If this is not the right way to model, please suggest on how to support the above queries. Maintaining three different tables worries me about the number of insertions (for read/unread) as number of users * emails viewed per day will be huge.

Upvotes: 3

Views: 356

Answers (1)

Jeff Jirsa
Jeff Jirsa

Reputation: 4426

Your index (userID) is high cardinality - you'd probably want to manage that as a second (or third) CF that you manually sync with the application.

Perhaps something like

CREATE TABLE READ_TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

CREATE TABLE UNREAD_TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

That gives you the ability to satisfy queries like:

SELECT * FROM READ_TIMELINE where userID = 12;
SELECT * FROM UNREAD_TIMELINE where userID = 12;
SELECT * FROM READ_TIMELINE where userID = 12 order by timestamp asc;
SELECT * FROM UNREAD_TIMELINE where userID = 12 order by timestamp asc;

That is, you use the natural clustering order for the ORDER BY, and you can move emails from UNREAD to READ with a simple batch (one DELETE, one INSERT)

Now, you'll end up with potentially lots of tombstones in the UNREAD table, as you mark emails read. Setting GCGS low and using frequent compaction can help that somewhat, but you may also want to break those partitions up to avoid having tombstoneoverwhelming issues if you have thousands of emails fly in, get marked read.

Upvotes: 2

Related Questions