Reputation: 546
I'm trying to understand the best practices around storing aggregated time series based data.
For instance if I am building a weather service application that's ingesting lots of weather metrics from sensors around the world and storing that weather data in the form of the weather for today, the week, for the month, what's a good way to model that?
Would the day level, week level, and month level each have their own column family?
Then there's the factor of location. Each location would have it's own weather data, so would partitioning by say some zipcode or geohash for a specific area make sense?
The access patterns would be querying for the daily or weekly or monthly weather in a city.
Upvotes: 0
Views: 251
Reputation: 649
@dipen, you could also refer to this documentation where it walks developers through various data models by their use case. @AlexOtt has great questions to begin with the data models for your use case and @aaron has a great example demonstration.
Here is an example. You could very much customize it for your weather use case. For a given access pattern requirement like in the below example,
we would go ahead and design a Cassandra table as follows to answer them,
Upvotes: 0
Reputation: 57798
let's say ever 5 minutes. Would that have an impact on the design?
Yes. So sensor updates every 5 minutes happen at 12x per hour or 288x per day.
The access patterns would be querying for the daily or weekly or monthly weather in a city.
That also makes for 2016x per week and 8640x per month (30 days). The reason this is important, is because Cassandra has hard limits of storing 2GB and 2 billion cells per partition. This means that storing time series data by city only, would eventually hit this limit (although things would likely grind to a halt long before that).
But the general idea is that you want to model your tables around:
So if we're just talking about temperatures and maybe a few other data points (precipitation, etc), partitioning by month and city should work just fine.
CREATE TABLE weather_sensor_data (
city TEXT,
month INT,
temp FLOAT,
recorded_time TIMESTAMP,
PRIMARY KEY ((city,month),recorded_time))
WITH CLUSTERING ORDER BY (recorded_time DESC);
Now, I could query for weather sensor data since 8AM, like this:
> SELECT * FROM weather_sensor_data
WHERE city='Minneapolis, MN'
AND month=202111
AND recorded_time > '2021-11-01 08:00';
city | month | recorded_time | temp
-----------------+--------+---------------------------------+------
Minneapolis, MN | 202111 | 2021-11-01 08:35:00.000000+0000 | 3
Minneapolis, MN | 202111 | 2021-11-01 08:30:00.000000+0000 | 3
Minneapolis, MN | 202111 | 2021-11-01 08:25:00.000000+0000 | 2
Minneapolis, MN | 202111 | 2021-11-01 08:20:00.000000+0000 | 2
Minneapolis, MN | 202111 | 2021-11-01 08:15:00.000000+0000 | 2
(5 rows)
This should help you get started.
Upvotes: 0