How do I count distinct values based on a non-primary key filter in Cassandra?

Question

relational databases and Cassandra. With two tables like the following:

TABLE 1: PRIMARY KEY (ID, DATE));

ID	DATE	TRIP_TIME
B03291	2022-01-01	5
B03291	2022-01-02	6
ZR7875	2022-01-01	2
ZR7875	2022-01-02	0

TABLE 2: PRIMARY KEY ((ID, TYPE), DATE))

TYPE	ID	DATE	TRIP_TIME
A	B03291	2022-01-01	5
A	B03291	2022-01-02	6
B	ZR7875	2022-01-01	2
B	ZR7875	2022-01-02	0
A	GF4589	2022-01-01	7

The two tables have the same data but aggregated in a different way.

Using the table that suits better for this query, I need to get the COUNT of all the IDs that have a trip_time greater than 0 on the DATE = '2022-01-01', but I can´t use allow filtering or create another table.

I have been using the query:

SELECT COUNT(ID)
FROM table1
WHERE date = '2022-01-01'
AND trip_time > 0;

But it raises an error and asks me to allow filtering. If I can´t specify an ID, because I want the COUNT for all, is there any way to do this?

Thank you for your help and sorry if it is too obvious.

Erick Ramirez · Accepted Answer

Cause

You are getting this error because your query doesn't have a filter on primary key columns:

InvalidRequest: Error from server: code=2200 [Invalid query] \
  message="Cannot execute this query as it might involve data filtering and thus may have \
  unpredictable performance. If you want to execute this query despite the performance \
  unpredictability, use ALLOW FILTERING"

Neither the trip date nor the trip time are primary key columns for the tables so it is not possible to query using these columns.

Warning

The ALLOW FILTERING clause enables filtering on non-primary key columns by performing a full table scan, querying every single partition on all nodes so it is very expensive and unpredictable.

The ALLOW FILTERING clause is only recommended for use when the query is restricted to a single partition.

Workaround

In order to query against non-primary key columns, you need to index the columns. To illustrate with an example, here's my table which has the trip id as the primary key:

CREATE TABLE stackoverflow.trips_by_id (
    id text PRIMARY KEY,
    tripdate date,
    triptime int
)

If I want to run queries using either tripdate or triptime, I need to index these columns with:

CREATE CUSTOM INDEX tripdate_idx ON stackoverflow.trips_by_id (tripdate);
CREATE CUSTOM INDEX triptime_idx ON stackoverflow.trips_by_id (triptime);

Now that I have indexed them, I can execute queries like:

SELECT ... FROM trips_by_id
  WHERE tripdate = ?
  AND triptime = ?

WARNING: Be aware that indexing has its own issues so be aware of the pros and cons. Have a look at When to use and not use an index for details.

Solution

Cassandra is designed for high throughput, high velocity online transaction (OLTP) use cases where you are retrieving data one partition at a time (queries filtered by partition key).

In contrast, your query is analytics (OLAP) in nature because you are not reading just one partition -- you are scanning through the whole table. As such, the best way to run analytics queries is to use Apache Spark with the Spark Cassandra connector. Cheers!

👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag button. 🙏 Thanks!

How do I count distinct values based on a non-primary key filter in Cassandra?

Answers (1)

Cause

Warning

Workaround

Solution

Related Questions