Mohan
Mohan

Reputation: 11

Timeseries data modelling in cassandra

I am trying to store & retrieve data in cassandra in the following way:

Storing Data: I created the table in the following way:

CREATE TABLE mydata (
    myKey TEXT,
    datetime TIMESTAMP,
    value TEXT,
    PRIMARY KEY (myKey,datetime)
);

Where i would store a value for every minute for last 5 years. So it stores 1440 * 365 * 5 = 2628000 records/columns per row (myKey as row key).

INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:01:00','72F');
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:02:00','72F');
INSERT INTO mydata(myKey, datetime, value) VALUES ('1234ABCD','2013-04-03 07:03:00','72F');

.................

I am able to store data and all fine. However, i would like to know, if this is efficient way of doing (storing) data horizontally (2628000 values for each key for 1 million such keys altogether) ?

Retrieving Data:

After storing the data in above format, i am able to select data by using a simple select query for a period. Ex:

SELECT * 
FROM mydata 
WHERE myKey='1234ABCD' AND datetime > '2013-04-03 07:01:00' AND datetime < '2013-04-03 07:04:00';

The query works fine and i get result as expected.

However my question is:

  1. How can i select only those values at certain intervals. For example, if i query data for a day, i would get 1440 values (1 for every minute). I would like to get values at every 10 minutes interval (value at every 10th minute) limiting the no. of values to 144.
  2. Is there a way to query the table if we use the above storage strategy?
  3. If not, what are possible options to meet my requirement of querying data at a specific interval like 1-min, 10-min, 1-hour, 1-day etc?

Appreciate any other suggestions.

Upvotes: 0

Views: 185

Answers (2)

Swan Jml
Swan Jml

Reputation: 36

You definitely need to bound your partition with a modular version of the timestamp. But the granularity really depends on your reads.

If you are mainly going to read per day then use something like this PK((myKey, yyyymmdd), time)

If mainly by weeks PK((mykey, yyyyww), time), or month...

The problem is then if you want to read values for a whole year, then you better have a partition per weeks or month, or even year would do I think if you don't do any deletes, your partition size needs to be smaller than 100MB

Upvotes: 0

mahendra singh
mahendra singh

Reputation: 384

No it not right ,in future you will face problem because per row key we can only store 2 billion records or columns. After that it will not give error but it will store data also . For your problem split column timestamp into year , month , day and time . like 2016 , 04 , 04 and 15:03:00 .Put also year , month , day into partition key .

Upvotes: 2

Related Questions