Jason
Jason

Reputation: 2076

What columns to PARTITION BY in a time-series table?

I want to collect time-series data and store it in snappydata store. I will be collecting millions of rows of data and I want to make queries across timeslices/ranges.

Here is an example query I want to do:

select avg(value) 
from example_timeseries_table 
where time >= :startDate and time < :endDate;

So, I am thinking that I want to have PARTITION BY COLUMN on time columns rather than the classic PRIMARY KEY column. In other technologies that I am familiar with like Cassandra DB, using the time columns in the partition key would point me directly at the partition and allow pulling the data for the timeslice in a single node rather than across many distributed nodes.

To be performant, I assume I need to partition by column 'time', in this table.

example_timeseries_table
------------------------
id int not nullable, 
value varchar(128) not nullable,  
time   timestamp not nullable 
PERSISTENT ASYNCHRONOUS 
PARTITION BY COLUMN time

Is this the correct column to partition on for efficient, time-slice queries or do I need to make even more columns like: year_num, month_num, day_num, hour_num columns and PARTITION BY COLUMN on all of them as well, then do a query like this to focus the query to a particular partitioned node?:

select avg(value) 
from example_table 
where year_num = 2016 
and month_num= 1 
and day_num = 4 
and hour_num = 11 
and time >= :startDate and time < :endDate;

Upvotes: 1

Views: 130

Answers (1)

Hemant Bhanawat
Hemant Bhanawat

Reputation: 165

When a single partition has all the data, a single processor processes that data and you lose distributed processing. In fact, if you have time series data, most of the time you would be querying the node that holds the latest time range and the rest of your compute capacity sits idle. If you expect concurrent queries on various time ranges then it may be fine but that is not the case most of the time.

Assuming that you are working with row tables, another way to speed up your queries would be by creating an index on your time column.

SnappyData supports partition pruning on row tables. In case you decide to go the way you mention here, the timestamp column's partition pruning should work.

Upvotes: 1

Related Questions