amankapur91
amankapur91

Reputation: 484

CQL data model to bypass secondary index issuee

I have model which looks like

StateChange: 
    row_id
    group_name
    timestamp
    user_id

I aim to query as follows:

Query 1 = Find all state changes with row_id = X ORDER BY Timestamp DESC Query 2 = Find all state changes with row_id = X and group_name = Y ORDER BY Timestamp DESC

Using my limited CQL knowledge, the only way to do so was to create 2 query tables one for each query mentioned above

For query 1:

CREATE TABLE state_change (
    row_id int,
    user_id int,
    group_name text,
    timestamp timestamp,
    PRIMARY KEY (row_id, timestamp)
)

For query 2:

CREATE TABLE state_change_by_group_name (
    row_id int,
    user_id int,
    group_name text,
    timestamp timestamp,
    PRIMARY KEY ((row_id, group_name), timestamp)
)

This does solve the problem but I have duplicated data in Cassandra now.

Note: Creating an group_name index on table works but I cannot ORDER BY timestamp anymore as its is the secondary index now.

Looking for a solution which requires only one table.

Upvotes: 1

Views: 102

Answers (2)

Aaron
Aaron

Reputation: 57748

Carlo is correct in that your multiple table solution is the proper approach here.

This does solve the problem but I have duplicated data in Cassandra now.

... Looking for a solution which requires only one table.

Planet Cassandra recently posted an article on this topic: Escaping From Disco-Era Data Modeling

(Full disclosure: I am the author)

But two of the last paragraphs really address your point (especially, the last sentence):

That is a very 1970′s way of thinking. Relational database theory originated at a time when disk space was expensive. In 1975, some vendors were selling disk space at a staggering eleven thousand dollars per megabyte (depending on the vendor and model). Even in 1980, if you wanted to buy a gigabyte’s worth of storage space, you could still expect to spend around a million dollars. Today (2014), you can buy a terabyte drive for sixty bucks. Disk space is cheap; operation time is the expensive part. And overuse of secondary indexes will increase your operation time.

Therefore, in Cassandra, you should take a query-based modeling approach. Essentially, (Patel, 2014) model your column families according to how it makes sense to query your data. This is a departure from relational data modeling, where tables are built according to how it makes sense to store the data. Often, query-based modeling results in storage of redundant data (and sometimes data that is not dependent on its primary row key)…and that’s ok.

Upvotes: 1

Carlo Bertuccini
Carlo Bertuccini

Reputation: 20021

The solution you're looking for does not exists. Two different queries requires two different tables (or at least a secondary index which creates a table under the hood). Denormalization is the norm in Cassandra so you should not think at data duplication as an anti-pattern -- indeed it's the suggested pattern

Upvotes: 2

Related Questions