Reputation: 1864
There are a lot of questions about storing time series with cassandra, but no one fits to our question, because they all assume a fixed data source and known column names.
About our problem:
We're developing a stream data engine, which can connect to different data sources, where the engine receives the data as continuous streams.
So, we have for example two data sources called energy
and weather
. Each incoming stream (or data source) has its own unique key and usually its own schema, e.g.:
source ID 1
with schema energy
may has this stream:
timestamp | volts | amps | watts | state
1467795743173 | 210.4 | 2.3 | 290 | "up"
1467795744173 | 212.1 | 2.1 | 287 | "up"
1467795745173 | 213.1 | 2.2 | 242 | "up"
...
source ID 2
with schema weather
may has this stream:
ts | condition | temp
1467795740632 | "cloudy" | 33.1
1467795741381 | "cloudy" | 33.4
...
Now we want to give the possibility to store the streams into cassandra, so they can be used later to "replay" a recorded stream, to fetch historic results (e.g. for analytics) and to enrich/join incoming streams with a specific stored data value (e.g. to show/compare the current energy value with the one that was recorded a week ago). In sum we need essentially these things:
Since we're new to cassandra, we're currently have no idea what is the best way to model the tables and columns.
Most answers to similar questions don't face the possibility to have unknown schemas (they all assume that there is a timestamp, deviceId and a double value) and only facing the problem with the primary/partion keys.
We read about two options:
datapoints
, which contains all data with the source-ID+day as partition keys? But how do we handle the dynamic columns here? Do we have to serialize the entire tuple into a single common column, e.g. putting the energydata as well as the weather data all in one column called value
. So we have this table:
CREATE TABLE datapoints (
sourceid bigint,
date text,
time timestamp,
value text,
PRIMARY KEY ((sourceid, date), time)
)
Obviously, we cannot use aggreate or other functions on the raw values (e.g. watts, amps or temp).
The other possibility would be to create a table per data source, e.g. using the day as a partition key:
CREATE TABLE energy_1 (
date text,
time timestamp,
volts double,
amps double,
watts double,
state text,
PRIMARY KEY (date, time)
)
CREATE TABLE weather_2 (
date text,
time timestamp,
condition text,
temp double,
PRIMARY KEY (date, time)
)
Since the data would be partioned using the date, is it possible to fetch e.g. a week or would this not be possible? Although there could be more than one data source with the same schema (e.g. two energy data sources), we don't know it and this is very seldom. So using device-id as a partioning key would not make sense, because there is mostly only one device-key per schema.
But the second solution also looks not very suitable.
We hope that someone also solved a similar problem and has some advice?!
Remark: we don't want to use other time series db :)
Upvotes: 0
Views: 584
Reputation: 1371
Consider using a map for your data values:
CREATE TABLE datapoints (
sourceid bigint,
date text,
time timestamp,
values map<text, text>,
PRIMARY KEY ((sourceid, date), time)
)
You could also use maps for distinct data types:
CREATE TABLE datapoints (
sourceid bigint,
date text,
time timestamp,
strvalues map<text, text>,
intvalues map<text, int>,
decvalues map<text, decimal>,
PRIMARY KEY ((sourceid, date), time)
)
Upvotes: 1
Reputation: 1490
Maybe you can use a mix of the two solutions :
You can create a table with all informations like :
CREATE TABLE datapoints (
sourceid bigint,
date text,
time timestamp,
value text,
volts double,
amps double,
watts double,
state text,
condition text,
temp double,
PRIMARY KEY ((sourceid, date), time)
);
When you insert data, many columns may not specified. But no matter, cassandra don't store NULL values on disk. So this table with columns conditions
and temp
specified with take as mutch space as table weather_2
in your exemple.
If you use cassandra >= 3.0, yopu can create materialized view to create specific tables (like energy_1
and weather_2
) or to create other way to read data (respecify partition key for exemple).
Hope this can help you.
Upvotes: 0