How to model data in Cassandra to avoid useless duplication

Question

I'm setting up a new service that manage some payment transactions. I have four use cases:

Read last 10 transactions.
Read last transaction of the day.
Read last 10 transactions by a card number
Read last 10 transactions done with a specific device (Ex. id of the device is "device01").

To reach this goal I created three tables:

CREATE TABLE test.transaction_by_device (
uid  uuid ,
device text,
time timestamp,
amount double, 
currency text, 
cardNumber text,
PRIMARY KEY ((device),time)
) WITH CLUSTERING ORDER BY (time DESC);


CREATE TABLE test.transaction_by_cardNumber (
uid  uuid ,
device text,
time timestamp,
amount double, 
currency text, 
cardNumber text,  
PRIMARY KEY ((cardNumber),time)
) WITH CLUSTERING ORDER BY (time DESC);

CREATE TABLE test.transaction_by_time (
uid  uuid ,
device text,
year text,
month text,
day text,
time timestamp,
amount double, 
currency text, 
cardNumber text, 
PRIMARY KEY ((year,month,day),time)
) WITH CLUSTERING ORDER BY (time DESC);

The queries that I used are:

const select_last_10_transactions_by_time =
    'SELECT * FROM test.transaction_by_time LIMIT 10';
const select_last_10_transactions_of_the_day_by_time =
    "SELECT * FROM test.transaction_by_time WHERE YEAR='2019' AND MONTH='2' AND DAY='22'";
const select_last_10_transactions_by_cardNumber =
    "SELECT * FROM test.transaction_by_cardNumber where cardNumber='4242800217402773' LIMIT 10";
const select_last_10_transactions_by_device =
    "SELECT * FROM test.transaction_by_device where device='device01' ";

Everything is working very well, and select time for 1 million row is about 0.5s

The problem is that every data are duplicated in every table. What I'm doing wrong with the data-modeling in this use cases?

Is there a better way to optimize these queries/tables?

I read something about the materialized view but I read also that aren't ready for production.

Link source

Thanks.

Chris Lohfink · Accepted Answer

The problem is that every data are duplicated in every table. What I'm doing wrong with the data-modeling in this use cases?

You are not doing it wrong, you are doing it right. Disk space is usually far cheaper than trying to do distributed joins. Especially tiny data like this. If anythings wrong I would worry about your partitions for a single card or device growing too large since its essentially unbounded. What if a customer sets up thing to make a transaction from a device every second? After a month or year that will get a bit difficult to work with.

How to model data in Cassandra to avoid useless duplication

Answers (1)

Related Questions