Reputation: 23
relational databases and Cassandra. With two tables like the following:
TABLE 1: PRIMARY KEY (ID, DATE));
ID | DATE | TRIP_TIME |
---|---|---|
B03291 | 2022-01-01 | 5 |
B03291 | 2022-01-02 | 6 |
ZR7875 | 2022-01-01 | 2 |
ZR7875 | 2022-01-02 | 0 |
TABLE 2: PRIMARY KEY ((ID, TYPE), DATE))
TYPE | ID | DATE | TRIP_TIME |
---|---|---|---|
A | B03291 | 2022-01-01 | 5 |
A | B03291 | 2022-01-02 | 6 |
B | ZR7875 | 2022-01-01 | 2 |
B | ZR7875 | 2022-01-02 | 0 |
A | GF4589 | 2022-01-01 | 7 |
The two tables have the same data but aggregated in a different way.
Using the table that suits better for this query, I need to get the COUNT of all the IDs that have a trip_time greater than 0 on the DATE = '2022-01-01', but I can´t use allow filtering or create another table.
I have been using the query:
SELECT COUNT(ID)
FROM table1
WHERE date = '2022-01-01'
AND trip_time > 0;
But it raises an error and asks me to allow filtering. If I can´t specify an ID, because I want the COUNT for all, is there any way to do this?
Thank you for your help and sorry if it is too obvious.
Upvotes: 0
Views: 261
Reputation: 16293
You are getting this error because your query doesn't have a filter on primary key columns:
InvalidRequest: Error from server: code=2200 [Invalid query] \
message="Cannot execute this query as it might involve data filtering and thus may have \
unpredictable performance. If you want to execute this query despite the performance \
unpredictability, use ALLOW FILTERING"
Neither the trip date nor the trip time are primary key columns for the tables so it is not possible to query using these columns.
The ALLOW FILTERING
clause enables filtering on non-primary key columns by performing a full table scan, querying every single partition on all nodes so it is very expensive and unpredictable.
The ALLOW FILTERING
clause is only recommended for use when the query is restricted to a single partition.
In order to query against non-primary key columns, you need to index the columns. To illustrate with an example, here's my table which has the trip id
as the primary key:
CREATE TABLE stackoverflow.trips_by_id (
id text PRIMARY KEY,
tripdate date,
triptime int
)
If I want to run queries using either tripdate
or triptime
, I need to index these columns with:
CREATE CUSTOM INDEX tripdate_idx ON stackoverflow.trips_by_id (tripdate);
CREATE CUSTOM INDEX triptime_idx ON stackoverflow.trips_by_id (triptime);
Now that I have indexed them, I can execute queries like:
SELECT ... FROM trips_by_id
WHERE tripdate = ?
AND triptime = ?
WARNING: Be aware that indexing has its own issues so be aware of the pros and cons. Have a look at When to use and not use an index for details.
Cassandra is designed for high throughput, high velocity online transaction (OLTP) use cases where you are retrieving data one partition at a time (queries filtered by partition key).
In contrast, your query is analytics (OLAP) in nature because you are not reading just one partition -- you are scanning through the whole table. As such, the best way to run analytics queries is to use Apache Spark with the Spark Cassandra connector. Cheers!
👉 Please support the Apache Cassandra community by hovering over the cassandra tag then click on the Watch tag
button. 🙏 Thanks!
Upvotes: 1