Cassandra Time Series Data Modelling and Limiting Partition Size

We are currently investigating Cassandra as the database for a large time series system.

I have read through https://academy.datastax.com/resources/getting-started-time-series-data-modeling about modelling time series data in Cassandra.

What we have is high velocity timeseries data coming in for many weather stations. Each weather station has a number of "sensors" that each collect three metrics: temperature, humidity, and light.

We are trying to store each series as a wide row. However, we expect to get billions of readings per station over the life of the project, so we would like to limit the row size.

We would like there to be a single row for each (weather_station_id, year, day_of_year), that is, a new row for every day. However, we still want the partition key to be weather_station_id - that is, we want all readings for a station to be on the same node.

We currently have the following schema, but I would like to get some feedback.

CREATE TABLE weather_station_data (
    weather_station_id int,
    year int,
    day_of_year int,
    time timestamp,
    sensor_id int,
    temperature int,
    humidity int,
    light int,
    PRIMARY KEY ((weather_station_id), year, day_of_year, time, sensor_id)
) WITH CLUSTERING ORDER BY (year DESC, day_of_year DESC, time DESC,       sensor_id DESC);

In the aforementioned document, they make use of this "limit partition row by date" concept. However, it is unclear to me whether or not the date in their examples is part of the partition key.

Upvotes: 2

Answers (2)

Citrullin

Reputation: 2321

In my opinion the datastax model isn't really great. The problem with this model:

They are using the weather station as partition key. All rows with the same partition key are stored on the same machine. This means: If you have 10 years raw data (100ms steps), you will break cassandras limit really fast. 10 years × 365 days × 24 hours × 60 min × 60 seconds x 10 (for 100ms steps) x 7 columns. The limit is 2 Billion. In my opinion you will not use the benefits of cassandra if you build this data model. You can also use, for each weather station, a mongo, mysql or another database.

The better solution: Ask yourself how you will query this data. If you say: I query all data per year, use the year also as partion key. If you need also query data of more than one year, you can create two queries with a different year. This works and the performance is better. (The bottleneck is maybe only the network to your client)

One little more tipp: Cassandra isn't like mysql. It's a denormalized database. This means: It's not dirty to save your data more than one time. This means: It is important for your to query your data per year, it's also important to query your data per hour, per day of year or per sensor_id, you can create column families with different partition key and parimary key order. It's okay to duplicate your data. Cassandra is optimized for write performance, not for read. This means: It's often better to write the data in the right order instead of reading it in the right order. In cassandra 3.0 there is a new feature, called materialized views, for automatic duplicating. And if you think: Ohhh no, i will duplicate the needed storage. Remember: Storage is really cheap. It's okay to buy ten HDDs with 1tb. It cost nothing. The performance is important.

I have one question to your: Can you aggregate your data? Cassandra has a column type called counter. You can create a java/scala application where your aggregate your data while they are produced. You can use a streaming framework for this: Flink or Spark. (If you need a bit more than only counting.). One scenario: You aggregating your data for each hour and day. You got your data in your streaming app. Now: You have an variable for hourly data. You count up or down or whatever. If the hour is finishes, your put this row in your hourly column family and daily column family. In your daily column family your using a counter. I hope, you understand what i mean.

Upvotes: 0

chaitan64arun

Reputation: 783

According to the tutorial, if we choose to have weather_station_id as the only partition, the row will be exhausted. i.e C* has a practical limitation of 2 billion columns per partition.

So IMO, your data-model is bad.

However, it is unclear to me whether or not the date in their examples is part of the partition key.

The tutorial used

PRIMARY KEY ((weatherstation_id,date),event_time)

So, yes they considered data to be part of partition key.

we want all readings for a station to be on the same node.

I am not sure, why you wan't such a requirement. You can always fetch weather data using multiple queries for more than one year.

select * from weather_station_data where weather_station_id=1234 and year= 2013; select * from weather_station_data where weather_station_id=1234 and year= 2014;

So consider changing your structure to

PRIMARY KEY ((weather_station_id, year), day_of_year, time, sensor_id)

Hope it helps!

Upvotes: 1

Cassandra Time Series Data Modelling and Limiting Partition Size

Answers (2)

Related Questions