What columns to PARTITION BY in a time-series table?

Question

I want to collect time-series data and store it in snappydata store. I will be collecting millions of rows of data and I want to make queries across timeslices/ranges.

Here is an example query I want to do:

select avg(value) 
from example_timeseries_table 
where time >= :startDate and time < :endDate;

So, I am thinking that I want to have PARTITION BY COLUMN on time columns rather than the classic PRIMARY KEY column. In other technologies that I am familiar with like Cassandra DB, using the time columns in the partition key would point me directly at the partition and allow pulling the data for the timeslice in a single node rather than across many distributed nodes.

To be performant, I assume I need to partition by column 'time', in this table.

example_timeseries_table
------------------------
id int not nullable, 
value varchar(128) not nullable,  
time   timestamp not nullable 
PERSISTENT ASYNCHRONOUS 
PARTITION BY COLUMN time

Is this the correct column to partition on for efficient, time-slice queries or do I need to make even more columns like: year_num, month_num, day_num, hour_num columns and PARTITION BY COLUMN on all of them as well, then do a query like this to focus the query to a particular partitioned node?:

select avg(value) 
from example_table 
where year_num = 2016 
and month_num= 1 
and day_num = 4 
and hour_num = 11 
and time >= :startDate and time < :endDate;

What columns to PARTITION BY in a time-series table?

Answers (1)

Related Questions