Reputation: 1087
I have a table of events that are done every minute. I want to be able to filter these events by time period and also aggregate data for hour/day/etc.
My data model:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
)
CREATE MATERIALIZED VIEW hour_dev_data AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
my query is
select hour, sum(value)
from hour_dev_data
where device = 'tst' and event_time < 149000000 group by device, hour;
fails with error code=2200 [Invalid query] message="PRIMARY KEY column "event_time" cannot be restricted as preceding column "hour" is not restricted"
The only way to make it work is to add ALLOW FILTERING, which is unpredictable.
How can I change my data model to address my query and avoid ALLOW FILTERING mode?
Upvotes: 0
Views: 868
Reputation: 14559
I think you had things pretty much right already, but you need to change your filter on event_time
to be a filter on hour
.
select hour, sum(value)
from hour_dev_data
where device = 'tst' and hour < 1500000000
group by device, hour;
When you were filtering on event_time, you were implicitly requiring a full scan of the row, as the event_time
is clustered after the hour. To filter by the event_time
, every cell would need to be examined to check the event_time
. When you filter by hour
, it is first in the clustering key, so it can be efficiently scanned and filtered. See this post post on ALLOW FILTERING for more on this.
I agree with xmas79 that you probably want to be bucketing at some level, perhaps by month or year depending on your frequency of events. If you're always going to be looking for the most recent values, then setting the clustering key ordering to desc is probably going to be helpful too:
CREATE MATERIALIZED VIEW hour_dev_data3 AS
SELECT device, event_time, hour, value
FROM min_dev_data
WHERE hour IS NOT NULL AND value IS NOT NULL
and event_time IS NOT NULL AND device IS NOT NULL
PRIMARY KEY ((device), hour, event_time)
WITH CLUSTERING ORDER BY (hour DESC);
Scheduling aggregations like xmas79 suggests is going to be more efficient as the sum is done once, rather than summing every time reads are done, however it does add more maintenance burden, where the materialised view handles it for you.
Upvotes: 1
Reputation: 5180
You have to proactively produce these results:
create table min_dev_data (
device TEXT,
event_time BIGINT,
hour BIGINT,
day BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table hour_dev_data (
device TEXT,
hour BIGINT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
create table day_dev_data (
device TEXT,
day BIGINT,
event_time BIGINT,
value DOUBLE,
PRIMARY KEY ((device), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
Each table will satisfy ONE granularity only.
Every hour you query the minute data for the latest hour data for each device with something like:
SELECT * FROM min_dev_data WHERE device = X AND event_time < YYYY
Sum that at application level and store this value into the hour table:
INSERT INTO hour_dev_data (device, hour, day, event_time, value) VALUES (....);
And every day you query the hour table to produce the further aggregate data:
SELECT * FROM hour_dev_data WHERE device = X AND event_time < YYYY
sum at application level and store this value into the day table.
Please consider adding some form of bucketing because, at one minute interval, in two months your minute table will have wide partitions. This should not be a problem if you keep the table in reverse order (like I did) and query only for the last couple of hours. But if you want to query back in time as well then you must definitely use bucketing in your tables.
Upvotes: 1