Reputation: 2418
I am fairly new to Cassandra and I am trying to understand how to design my tables for IoT sensors.
The idea is to have several devices, each with several sensors attached to it sending data periodically (up to around 200000 values per device per day per sensor)
I'd like to be able to query for the latest value of a sensor for a specific list of sensors and devices in more or less real-time. Also devices do not always send data and may be down for long periods of time.
After a lot of reading I came up with something like this
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
The idea behind this would be to perform one query per device and sensor such as
Select deviceid, sensorid, ts, value where deviceid = "device1" and sensorid = "temperature" limit 1
And run this for each device and sensor. It's not one query to return it all (Which would be ideal) but seems to be fast enough to run for potentially up to 100 sensors or so (With possibilities for parallelizing the queries) for a few devices.
However from what I have read so far, I understand this would give me a lot of columns for my row and it might be complicated in terms of long term storage and Cassandra limitations.
I am thinking that maybe adding something like the date to the table like so (as seen on some blogs and guides) might be a good idea
CREATE TABLE "sensor_data" (
deviceid TEXT,
sensorid TEXT,
date TEXT
ts timestamp,
value TEXT,
PRIMARY KEY ((deviceid, sensorid, date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
And then query like
Select deviceid, sensorid, date, ts, value where deviceid = "device1" and sensorid = "temperature" and date = "2018-11-14" limit 1
Does that even make sense? It feels like it might mitigate storage issues and allow for easier archiving of old data in the future however how do I go about querying for the latest value of a specific sensor and device if that device was down for a day or more? Do I really have to query for 1 day, if nothing is found, query the previous day and so forth (Maybe limit it to only the last few days or so)?
Are there better ways to handle this in Cassandra or am I in the right direction?
Upvotes: 0
Views: 858
Reputation: 66
Part of the problem that you'll run into is that each sensor will be having 200k readings per day. In general, you want to keep each partition under 100k rows. So, your second idea (having date as part of the PK) may have perf issues.
Really what you are looking to do is what we refer to as 'bucketing'; how to group things together so queries are usable and performant.
To really help with this, we will need to understand a little more information:
Incorporating this into the answer based on your answers (below):
Alright, here is a potential idea...
We DO care about bucketing though to try to stay around the 100k/partition optimal rows in a partition.
You're going to want two tables:
Lookup table will look something like:
CREATE TABLE lookup-table (
deviceid TEXT,
sensor-map MAP,
PRIMARY KEY (deviceid)
);
deviceid
is the unique ID for each devicesensor-map
is a JSON map of sensors that a given device has and a corresponding unique ID for that specific sensor (e.g. {temperature: 183439, humidity : 84543292, other-sensor : blah})SELECT * FROM lookup-table WHERE deviceid = 1234;
Sensor table will look like:
CREATE TABLE sensor_data (
sensorid TEXT,
sensor_value (whatever data type fits what you need),
ts TIMESTAMP,
reading_date date,
time_bucket int,
PRIMARY KEY ((reading_date, sensorid, time_bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
reading_date
) and split each day into two (due to the amount of readings that you're expecting); AM or PM; AM equals bucket 1, PM equals bucket 2. Or use 24 hour time where 0-1200 equals 1, 1300-2399 equals 2sensorid
and
time_bucket
will come from the time that you're actually requesting
the query (e.g. if time is 1135 hours, then time_bucket = 1
) and reading_date
will come from the actual day that you are queryingts DESC
then it will retrieve the
latest reading for that given sensorid
. So it would look like
SELECT * from sensor_data WHERE reading_date = 12/31/2017 AND sensorid = 1234 AND time_bucket = 1 LIMIT 1;
ts
as a clustering column, you'll be able to keep all of the readings for a given sensor; none will be overwrittenImportant to know: this works great if there is an even distribution of sensor readings throughout the 24-hour day. However, if you're reading heavily in the morning and not at all in the afternoon, then it isn't an even and we'll have to figure out another way to bucket. But, I think that you get what is going on.
To query:
sensorid
that a device has; once you have those sensorid
, you can then use it for the next stepsensor_value
for each sensorid
time_bucket
), you should have an even distribution throughout all of the partitionsLastly: give me the latest sensorid
by a given value
To do that there are a couple of different ways...
For the additional C* table:
CREATE TABLE sensor_by_value (
sensor-value INT,
ts TIMESTAMP,
sensorid TEXT,
reading_date DATE,
time_bucket INT,
PRIMARY KEY ((sensor-value, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
You will definitely have to do some time bucketing here:
Final tip
Look into compaction strategies: specifically time window compaction strategy (TWCS) and adding a default_time_to_live
Your data seems immutable after the initial insert
TWCS will make the operational overhead of compaction much lower as you fine-tune it for the time window that you need
A default_ttl
will also help with the operational overhead of deleting data after you don't need it anymore.
Does this answer and/or satisfy that queries that you're trying to answer? If not, let me know and we can iterate.
To learn all of this stuff, go to DataStax Academy for a ton of free training. Data Modeling (DS 220) is a great course!
Upvotes: 5