Reputation: 502
We are currently investigating Cassandra as the database for a large time series system.
I have read through https://academy.datastax.com/resources/getting-started-time-series-data-modeling about modelling time series data in Cassandra.
What we have is high velocity timeseries data coming in for many weather stations. Each weather station has a number of "sensors" that each collect three metrics: temperature, humidity, and light.
We are trying to store each series as a wide row. However, we expect to get billions of readings per station over the life of the project, so we would like to limit the row size.
We would like there to be a single row for each (weather_station_id, year, day_of_year)
, that is, a new row for every day. However, we still want the partition key to be weather_station_id
- that is, we want all readings for a station to be on the same node.
We currently have the following schema, but I would like to get some feedback.
CREATE TABLE weather_station_data (
weather_station_id int,
year int,
day_of_year int,
time timestamp,
sensor_id int,
temperature int,
humidity int,
light int,
PRIMARY KEY ((weather_station_id), year, day_of_year, time, sensor_id)
) WITH CLUSTERING ORDER BY (year DESC, day_of_year DESC, time DESC, sensor_id DESC);
In the aforementioned document, they make use of this "limit partition row by date" concept. However, it is unclear to me whether or not the date in their examples is part of the partition key.
Upvotes: 2
Views: 982
Reputation: 2321
In my opinion the datastax model isn't really great. The problem with this model:
The better solution: Ask yourself how you will query this data. If you say: I query all data per year, use the year also as partion key. If you need also query data of more than one year, you can create two queries with a different year. This works and the performance is better. (The bottleneck is maybe only the network to your client)
I have one question to your: Can you aggregate your data? Cassandra has a column type called counter. You can create a java/scala application where your aggregate your data while they are produced. You can use a streaming framework for this: Flink or Spark. (If you need a bit more than only counting.). One scenario: You aggregating your data for each hour and day. You got your data in your streaming app. Now: You have an variable for hourly data. You count up or down or whatever. If the hour is finishes, your put this row in your hourly column family and daily column family. In your daily column family your using a counter. I hope, you understand what i mean.
Upvotes: 0
Reputation: 783
According to the tutorial, if we choose to have weather_station_id as the only partition, the row will be exhausted. i.e C* has a practical limitation of 2 billion columns per partition.
So IMO, your data-model is bad.
However, it is unclear to me whether or not the date in their examples is part of the partition key.
The tutorial used
PRIMARY KEY ((weatherstation_id,date),event_time)
So, yes they considered data to be part of partition key.
we want all readings for a station to be on the same node.
I am not sure, why you wan't such a requirement. You can always fetch weather data using multiple queries for more than one year.
select * from weather_station_data where weather_station_id=1234 and year= 2013;
select * from weather_station_data where weather_station_id=1234 and year= 2014;
So consider changing your structure to
PRIMARY KEY ((weather_station_id, year), day_of_year, time, sensor_id)
Hope it helps!
Upvotes: 1