Reputation: 351
I have time series data on a PostgreSQL database. Currently I import this into pandas to do some analysis and my first step often times is resampling the 5 min frequency data to 1 hour average data. I do this so that I first pivot the data to bring it in wide form, then I resample it to 1 hour and after that I melt it to have it in long form again.
Now I want to do the resampling on the Database so that I can import the 1 hour average right away.
This is what the data looks like on the Database. I have two different types with 3 different names.
datetime value type name
0 2018-01-01 13:35:00+01:00 0.22 HLN NO2
1 2018-01-01 13:35:00+01:00 0.31 HLN CO
2 2018-01-01 13:35:00+01:00 1.15 HLN NO
3 2018-01-01 13:40:00+01:00 1.80 AIS NO2
4 2018-01-01 13:40:00+01:00 2.60 AIS CO
5 2018-01-01 13:40:00+01:00 2.30 AIS NO
6 2018-01-01 13:45:00+01:00 2.25 HLN NO2
7 2018-01-01 13:45:00+01:00 2.14 HLN CO
8 2018-01-01 13:45:00+01:00 2.96 HLN NO
9 2018-01-01 14:35:00+01:00 0.76 HLN NO2
10 2018-01-01 14:35:00+01:00 0.80 HLN CO
11 2018-01-01 14:35:00+01:00 1.19 HLN NO
12 2018-01-01 14:40:00+01:00 1.10 AIS NO2
13 2018-01-01 14:40:00+01:00 2.87 AIS CO
14 2018-01-01 14:40:00+01:00 2.80 AIS NO
15 2018-01-01 14:45:00+01:00 3.06 HLN NO2
16 2018-01-01 13:45:00+01:00 2.86 HLN CO
17 2018-01-01 13:45:00+01:00 2.22 HLN NO
Now comes the part which I have problems with. After resampling and plotting it in pandas and plotly I get the expected result which is right, one value for every hour:
After performing a SQL Query to resample it to one hour with the following code:
SELECT date_trunc('hour', datetime) AS hour, type, AVG(value) AS measure, name
FROM data_table
GROUP BY datetime, type, name
ORDER BY datetime
I get this after plotting:
This is not smooth and there are multiple values in the hour, I guess there are all the values within the hour.
My question, how can I correctly resample a time series in SQL?
Edit: Expected result in table form:
datetime value type name
2018-01-01 13:00:00 1.235 HLN NO2
2018-01-01 13:00:00 2.65 HLN CO
2018-01-01 13:00:00 2.96 HLN NO
2018-01-01 13:00:00 2.48 AIS NO2
2018-01-01 13:00:00 2.65 AIS CO
2018-01-01 13:00:00 2.26 AIS NO
2018-01-01 14:00:00 2.78 HLN NO2
2018-01-01 14:00:00 3.65 HLN CO
2018-01-01 14:00:00 1.95 HLN NO
2018-01-01 14:00:00 1.45 AIS NO2
2018-01-01 14:00:00 1.64 AIS CO
2018-01-01 14:00:00 3.23 AIS NO
Upvotes: 2
Views: 2646
Reputation: 19643
An alternative is to create the time intervals using generate_series()
or just with in a subquery / CTE truncate the hours per type
and name
, and in the outer query join both records and aggregate the values
with avg()
by the columns hour
, type
and name
, e.g.
WITH j AS (
SELECT DISTINCT date_trunc('hour', datetime) AS hour, type,name
FROM data_table
)
SELECT j.*, avg(d.value)
FROM data_table d
JOIN j ON date_trunc('hour', d.datetime) = j.hour AND
j.type = d.type AND
d.name = j.name
GROUP BY j.hour, j.name, j.type
ORDER BY j.hour ASC,j.type DESC;
returns
hour | type | name | avg |
---|---|---|---|
2018-01-01 13:00:00 | HLN | CO | 1.7700000000000000 |
2018-01-01 13:00:00 | HLN | NO | 2.1100000000000000 |
2018-01-01 13:00:00 | HLN | NO2 | 1.23500000000000000000 |
2018-01-01 13:00:00 | AIS | CO | 2.6000000000000000 |
2018-01-01 13:00:00 | AIS | NO | 2.3000000000000000 |
2018-01-01 13:00:00 | AIS | NO2 | 1.80000000000000000000 |
2018-01-01 14:00:00 | HLN | CO | 0.80000000000000000000 |
2018-01-01 14:00:00 | HLN | NO | 1.19000000000000000000 |
2018-01-01 14:00:00 | HLN | NO2 | 1.9100000000000000 |
2018-01-01 14:00:00 | AIS | CO | 2.8700000000000000 |
2018-01-01 14:00:00 | AIS | NO | 2.8000000000000000 |
2018-01-01 14:00:00 | AIS | NO2 | 1.10000000000000000000 |
Upvotes: 1