Reputation: 349

MAX() function in Cassandra Query Language returns wrong values

I am running the below Cassandra query, for which I am getting the correct maximum value for max_sale_value, but wrong value for the corresponding country column.

I am using Cassandra 5.0.2 with the below configurations,

[cqlsh 6.2.0 | Cassandra 5.0.2 | CQL spec 3.4.7 | Native protocol v5]

Please correct me if there is any issue with the queries, I am adding DDL query as well.

CREATE TABLE sales_data (
    country TEXT,
    record_no UUID,
    invoice TEXT,
    stock_code TEXT,
    description TEXT,
    quantity INT,
    invoice_date TEXT,
    unit_price DECIMAL,
    customer_id TEXT,
    PRIMARY KEY (country, record_no)
);


SELECT MAX(quantity * unit_price) AS max_sale_value, country
FROM sales_data group by country; --gives max_sale_value of all the countries, but not 1 row which is the maximum among all the countries i.e. United Kingdom | 25111.090

SELECT Country, MAX(quantity * unit_price) AS max_sale_value
FROM sales_data ALLOW FILTERING; --gives max_sale_value correctly as 25111.090, but giving the wrong country name

Upvotes: 1

Answers (1)

Madhavan

Reputation: 649

thanks for the question.

TL;DR Ideally these computations would be performed at the application side, but Cassandra® does offer us with some functions that could help these to an extent.

Consider a data model like below (I've removed all other non-important columns for brevity here):

CREATE TABLE default_keyspace.so_79330671 (
    country text,
    quantity int,
    unit_price int,
    <...other columns...>
    PRIMARY KEY (country, quantity, unit_price)
) WITH CLUSTERING ORDER BY (quantity DESC, unit_price DESC)
<... other table properties ...>
;

NOTE: we're adding both the quantity and unit_price columns as clustering keys for this table and are instructing the database to store them in a DESC order for our benefit during the query operation. I'll explain this here in a bit. Disclaimer: Chosing country as the partition key is a bad design as this will come to bite us when this partition grows significantly in size (typically over 1MB in size) causing hot spots and unmanageable partition size, but that's not the core of this discussion and hence I'm leaving that out here.

and we insert a couple data:

token@cqlsh:default_keyspace> insert into so_79330671 (country,quantity,unit_price) values ('US',1,2);
token@cqlsh:default_keyspace> insert into so_79330671 (country,quantity,unit_price) values ('US',2,3);
token@cqlsh:default_keyspace> insert into so_79330671 (country,quantity,unit_price) values ('UK',1,1);
token@cqlsh:default_keyspace> insert into so_79330671 (country,quantity,unit_price) values ('UK',2,2);

We could then perform a query as below:

token@cqlsh:default_keyspace> select country, unit_price * quantity from so_79330671;

 country | unit_price * quantity
---------+-----------------------
      US |                     6
      US |                     2
      UK |                     4
      UK |                     1

(4 rows)

which fetches us the unit_price multiplied by quantity value across all the rows of the table. Since we've instructed C* to store the values in a descending order, because I've modelled the table according to my query (read) pattern.

OK, now what if I want the maximum sales per partition key, i.e. per country. I'll then leverage the concept of PER PARTITON LIMIT here to my benefit without using any aggregation operation:

token@cqlsh:default_keyspace> select country, unit_price * quantity from so_79330671 per partition limit 1;

 country | unit_price * quantity
---------+-----------------------
      US |                     6
      UK |                     4

(2 rows)

OK, what if I want the maximum sales across ALL partition keys, i.e. across all countries, then I will do it as below:

token@cqlsh:default_keyspace> select country, max(unit_price * quantity) from so_79330671;

 country | system.max(unit_price * quantity)
---------+-----------------------------------
      US |                                 6

(1 rows)

Warnings :
Aggregation query used without partition key

NOTE: This is going to do a scan of this entire table performing aggregation operation and that's not an efficient query and is a clear anti-pattern, but it's possible. You should only perform this at the partition key level and filter it out at your application layer.

Resources for further reading:

There is an excellent FREE browser-based tutorial here that walks us through how to design a data model in Cassandra®.
It is important that you design your tables based on the query pattern and keep the partition sizes (typically I'd say less than 1MB) for future scale and efficiency. Don't do ALLOW FILTERING in your select queries, unless you do it on very niche use cases like just on the partition alone.

I hope that helps!

p/s: I work at DataStax as of this writing and I've used DataStax Astra DB, a fully-managed SaaS to demonstrate these examples.

Upvotes: 2

MAX() function in Cassandra Query Language returns wrong values

Answers (1)

Related Questions