Reputation: 3282
I am in need of the best approach for the following use case,
I have 'Device' table (Only one Partition Id : 'Device') and I have another table 'DeviceStatistics' (Partition Id : 'deviceId' so that this table will have as many partition as number of devices) which means for every device there will be a statistics collected for every minute.
CREATE TABLE device(
"partitionId" text,"name" text,"deviceId" text, ..., primary key ("partitionId","name","deviceId"));
where partitionId - it is a constant ('device')
CREATE TABLE deviceStatistics (
"deviceId" text,
"timestamp" timestamp, ...,
primary key ("deviceId","timestamp")) with clustering order by ("timestamp" DESC);
where 'deviceId' - it is the partition key and under each partition will have the list of timestamp entries
Till this it is fine, Because I just need the following queries,
1) select * from device where partitionId = 'device'
- which list all the devices available.
2) select * from deviceStatistics where deviceId = 'deviceId_1'
- which list all the device statistics for a deviceId
3) select * from deviceStatistics where deviceId = 'deviceId_1' LIMIT 1
- which gets the most recent statistics for a deviceId
Now I need the solution for the following use case,
I need to collect the cluster level statistics which means I need to collect all the device statistics for the timestamp,
(i.e) If the deviceStatistics for 4 devices are available for the timestamp then I need to collect all the four statistics for a timestamp and add in device group level.
which means my DeviceGroupstatistics is the aggregation of all the device statistics for the timestamp.
Now the problem is, Since I have 'deviceId' as the partitionId for the deviceStatistics table, I need to perform this query (select * from deviceStatistics where deviceId = 'deviceId' LIMIT 1) for all the deviceIds. So Lets say I have 1000 devices, then I need to trigger this query for all the 1000 devices for every minute.
Is there a better design for this?
Upvotes: 1
Views: 66
Reputation: 1316
Alex Ott's proposal is good practice: duplicate your data in another table with a timestamped bucket (day, hour, minute, second, depending on input velocity) as partition key, and deviceid as first clustering column (depending on your query).
something like
PRIMARY KEY (bucket, device_id, timestamp ... etc)
Choosing the right bucket size is important : according to several posts, a partition in cassandra should not go too much above 100MB or so.
If you collect your statistics every minute, a day bucket for 1000 devices and 100 bytes record of data would result in a 1440 (24x60) x nbr of devices (1000) x size of record (100) partition size => 144,000,000 bytes per partition whichs sounds ok, but you have to do the estimation and measures with your data, this is a rough calculation.
If you have to query for sevaral days, you will have to add an IN clause to your query, with a limited number of terms (10 is considered lot), or do several queries, but they will be fast ... :)
bests,
Alain
Upvotes: 1
Reputation: 87279
I would recommend to have separate table where timestamp will be the partition key, and device ID is clustering key. Granularity of timestamp may depend on your application - for example, drop seconds & round to minutes, or something like.
You can implement storing the data from your application (preferred), or you can use materialized views (but they are experimental, and not always recommended to use).
Upvotes: 1