CQL data model to bypass secondary index issuee

Question

I have model which looks like

StateChange: 
    row_id
    group_name
    timestamp
    user_id

I aim to query as follows:

Query 1 = Find all state changes with row_id = X ORDER BY Timestamp DESC Query 2 = Find all state changes with row_id = X and group_name = Y ORDER BY Timestamp DESC

Using my limited CQL knowledge, the only way to do so was to create 2 query tables one for each query mentioned above

For query 1:

CREATE TABLE state_change (
    row_id int,
    user_id int,
    group_name text,
    timestamp timestamp,
    PRIMARY KEY (row_id, timestamp)
)

For query 2:

CREATE TABLE state_change_by_group_name (
    row_id int,
    user_id int,
    group_name text,
    timestamp timestamp,
    PRIMARY KEY ((row_id, group_name), timestamp)
)

This does solve the problem but I have duplicated data in Cassandra now.

Note: Creating an group_name index on table works but I cannot ORDER BY timestamp anymore as its is the secondary index now.

Looking for a solution which requires only one table.

Aaron · Accepted Answer

Carlo is correct in that your multiple table solution is the proper approach here.

This does solve the problem but I have duplicated data in Cassandra now.

... Looking for a solution which requires only one table.

Planet Cassandra recently posted an article on this topic: Escaping From Disco-Era Data Modeling

(Full disclosure: I am the author)

But two of the last paragraphs really address your point (especially, the last sentence):

That is a very 1970′s way of thinking. Relational database theory originated at a time when disk space was expensive. In 1975, some vendors were selling disk space at a staggering eleven thousand dollars per megabyte (depending on the vendor and model). Even in 1980, if you wanted to buy a gigabyte’s worth of storage space, you could still expect to spend around a million dollars. Today (2014), you can buy a terabyte drive for sixty bucks. Disk space is cheap; operation time is the expensive part. And overuse of secondary indexes will increase your operation time.

Therefore, in Cassandra, you should take a query-based modeling approach. Essentially, (Patel, 2014) model your column families according to how it makes sense to query your data. This is a departure from relational data modeling, where tables are built according to how it makes sense to store the data. Often, query-based modeling results in storage of redundant data (and sometimes data that is not dependent on its primary row key)…and that’s ok.

CQL data model to bypass secondary index issuee

Answers (2)

Related Questions