Reputation: 349
I am running the below Cassandra query, for which I am getting the correct maximum value for max_sale_value, but wrong value for the corresponding country column.
I am using Cassandra 5.0.2 with the below configurations,
[cqlsh 6.2.0 | Cassandra 5.0.2 | CQL spec 3.4.7 | Native protocol v5]
Please correct me if there is any issue with the queries, I am adding DDL query as well.
CREATE TABLE sales_data (
country TEXT,
record_no UUID,
invoice TEXT,
stock_code TEXT,
description TEXT,
quantity INT,
invoice_date TEXT,
unit_price DECIMAL,
customer_id TEXT,
PRIMARY KEY (country, record_no)
);
SELECT MAX(quantity * unit_price) AS max_sale_value, country
FROM sales_data group by country; --gives max_sale_value of all the countries, but not 1 row which is the maximum among all the countries i.e. United Kingdom | 25111.090
SELECT Country, MAX(quantity * unit_price) AS max_sale_value
FROM sales_data ALLOW FILTERING; --gives max_sale_value correctly as 25111.090, but giving the wrong country name
Upvotes: 1
Views: 27
Reputation: 649
thanks for the question.
TL;DR Ideally these computations would be performed at the application side, but Cassandra® does offer us with some functions that could help these to an extent.
Consider a data model like below (I've removed all other non-important columns for brevity here):
CREATE TABLE default_keyspace.so_79330671 (
country text,
quantity int,
unit_price int,
<...other columns...>
PRIMARY KEY (country, quantity, unit_price)
) WITH CLUSTERING ORDER BY (quantity DESC, unit_price DESC)
<... other table properties ...>
;
NOTE: we're adding both the quantity
and unit_price
columns as clustering keys for this table and are instructing the database to store them in a DESC
order for our benefit during the query operation. I'll explain this here in a bit. Disclaimer: Chosing country
as the partition key is a bad design as this will come to bite us when this partition grows significantly in size (typically over 1MB
in size) causing hot spots and unmanageable partition size, but that's not the core of this discussion and hence I'm leaving that out here.
and we insert a couple data:
token@cqlsh:default_keyspace> insert into so_79330671 (country,quantity,unit_price) values ('US',1,2);
token@cqlsh:default_keyspace> insert into so_79330671 (country,quantity,unit_price) values ('US',2,3);
token@cqlsh:default_keyspace> insert into so_79330671 (country,quantity,unit_price) values ('UK',1,1);
token@cqlsh:default_keyspace> insert into so_79330671 (country,quantity,unit_price) values ('UK',2,2);
We could then perform a query as below:
token@cqlsh:default_keyspace> select country, unit_price * quantity from so_79330671;
country | unit_price * quantity
---------+-----------------------
US | 6
US | 2
UK | 4
UK | 1
(4 rows)
which fetches us the unit_price
multiplied by quantity
value across all the rows of the table. Since we've instructed C* to store the values in a descending order, because I've modelled the table according to my query (read) pattern.
OK, now what if I want the maximum sales per partition key, i.e. per country. I'll then leverage the concept of PER PARTITON LIMIT
here to my benefit without using any aggregation operation:
token@cqlsh:default_keyspace> select country, unit_price * quantity from so_79330671 per partition limit 1;
country | unit_price * quantity
---------+-----------------------
US | 6
UK | 4
(2 rows)
OK, what if I want the maximum sales across ALL partition keys, i.e. across all countries, then I will do it as below:
token@cqlsh:default_keyspace> select country, max(unit_price * quantity) from so_79330671;
country | system.max(unit_price * quantity)
---------+-----------------------------------
US | 6
(1 rows)
Warnings :
Aggregation query used without partition key
NOTE: This is going to do a scan of this entire table performing aggregation operation and that's not an efficient query and is a clear anti-pattern, but it's possible. You should only perform this at the partition key level and filter it out at your application layer.
Resources for further reading:
ALLOW FILTERING
in your select queries, unless you do it on very niche use cases like just on the partition alone.I hope that helps!
p/s: I work at DataStax as of this writing and I've used DataStax Astra DB, a fully-managed SaaS to demonstrate these examples.
Upvotes: 2