Cassandra: time series with dynamic columns/sources

Question

There are a lot of questions about storing time series with cassandra, but no one fits to our question, because they all assume a fixed data source and known column names.

About our problem: We're developing a stream data engine, which can connect to different data sources, where the engine receives the data as continuous streams. So, we have for example two data sources called energy and weather. Each incoming stream (or data source) has its own unique key and usually its own schema, e.g.:

source ID 1 with schema energy may has this stream:

timestamp | volts | amps | watts | state
1467795743173 | 210.4 | 2.3 | 290  | "up"
1467795744173 | 212.1 | 2.1 | 287  | "up"
1467795745173 | 213.1 | 2.2 | 242  | "up"
...

source ID 2 with schema weather may has this stream:

ts | condition | temp
1467795740632 | "cloudy" | 33.1
1467795741381 | "cloudy" | 33.4
...

Now we want to give the possibility to store the streams into cassandra, so they can be used later to "replay" a recorded stream, to fetch historic results (e.g. for analytics) and to enrich/join incoming streams with a specific stored data value (e.g. to show/compare the current energy value with the one that was recorded a week ago). In sum we need essentially these things:

read sequentially for a certian source ID
grab ranges for a certain source ID (e.g. fetch all from device ID 2 from 2016-07-06 between 10:00 and 11:00)
save arbitrary columns (in addition to a fixed timestamp-column), because we don't know the schema in advance.

Since we're new to cassandra, we're currently have no idea what is the best way to model the tables and columns.

Most answers to similar questions don't face the possibility to have unknown schemas (they all assume that there is a timestamp, deviceId and a double value) and only facing the problem with the primary/partion keys.

We read about two options:

Should we have one single table, e.g. datapoints, which contains all data with the source-ID+day as partition keys? But how do we handle the dynamic columns here? Do we have to serialize the entire tuple into a single common column, e.g. putting the energydata as well as the weather data all in one column called value.

So we have this table:

CREATE TABLE datapoints (
 sourceid bigint,
 date text, 
 time timestamp, 
 value text,
  PRIMARY KEY ((sourceid, date), time)
 )

Obviously, we cannot use aggreate or other functions on the raw values (e.g. watts, amps or temp).

The other possibility would be to create a table per data source, e.g. using the day as a partition key:

CREATE TABLE energy_1 (   
 date text, 
 time timestamp, 
 volts double,
 amps double,
 watts double,
 state text,
  PRIMARY KEY (date, time)
 )

CREATE TABLE weather_2 (
 date text, 
 time timestamp, 
 condition text,
 temp double,
  PRIMARY KEY (date, time)
 )

Since the data would be partioned using the date, is it possible to fetch e.g. a week or would this not be possible? Although there could be more than one data source with the same schema (e.g. two energy data sources), we don't know it and this is very seldom. So using device-id as a partioning key would not make sense, because there is mostly only one device-key per schema.

But the second solution also looks not very suitable.

We hope that someone also solved a similar problem and has some advice?!

Remark: we don't want to use other time series db :)

Brad Schoening · Accepted Answer

Consider using a map for your data values:

CREATE TABLE datapoints (
  sourceid bigint,
  date text, 
  time timestamp, 
  values map,
  PRIMARY KEY ((sourceid, date), time)
 )

You could also use maps for distinct data types:

CREATE TABLE datapoints (
  sourceid bigint,
  date text, 
  time timestamp, 
  strvalues map,
  intvalues map,
  decvalues map,
  PRIMARY KEY ((sourceid, date), time)
 )

Cassandra: time series with dynamic columns/sources

Answers (2)

Related Questions