Reputation: 484
I have model which looks like
StateChange:
row_id
group_name
timestamp
user_id
I aim to query as follows:
Query 1 = Find all state changes with row_id = X ORDER BY Timestamp DESC Query 2 = Find all state changes with row_id = X and group_name = Y ORDER BY Timestamp DESC
Using my limited CQL knowledge, the only way to do so was to create 2 query tables one for each query mentioned above
For query 1:
CREATE TABLE state_change (
row_id int,
user_id int,
group_name text,
timestamp timestamp,
PRIMARY KEY (row_id, timestamp)
)
For query 2:
CREATE TABLE state_change_by_group_name (
row_id int,
user_id int,
group_name text,
timestamp timestamp,
PRIMARY KEY ((row_id, group_name), timestamp)
)
This does solve the problem but I have duplicated data in Cassandra now.
Note: Creating an group_name index on table works but I cannot ORDER BY timestamp anymore as its is the secondary index now.
Looking for a solution which requires only one table.
Upvotes: 1
Views: 102
Reputation: 57748
Carlo is correct in that your multiple table solution is the proper approach here.
This does solve the problem but I have duplicated data in Cassandra now.
... Looking for a solution which requires only one table.
Planet Cassandra recently posted an article on this topic: Escaping From Disco-Era Data Modeling
(Full disclosure: I am the author)
But two of the last paragraphs really address your point (especially, the last sentence):
That is a very 1970′s way of thinking. Relational database theory originated at a time when disk space was expensive. In 1975, some vendors were selling disk space at a staggering eleven thousand dollars per megabyte (depending on the vendor and model). Even in 1980, if you wanted to buy a gigabyte’s worth of storage space, you could still expect to spend around a million dollars. Today (2014), you can buy a terabyte drive for sixty bucks. Disk space is cheap; operation time is the expensive part. And overuse of secondary indexes will increase your operation time.
Therefore, in Cassandra, you should take a query-based modeling approach. Essentially, (Patel, 2014) model your column families according to how it makes sense to query your data. This is a departure from relational data modeling, where tables are built according to how it makes sense to store the data. Often, query-based modeling results in storage of redundant data (and sometimes data that is not dependent on its primary row key)…and that’s ok.
Upvotes: 1
Reputation: 20021
The solution you're looking for does not exists. Two different queries requires two different tables (or at least a secondary index which creates a table under the hood). Denormalization is the norm in Cassandra so you should not think at data duplication as an anti-pattern -- indeed it's the suggested pattern
Upvotes: 2