Reputation: 4675
I have a time series table where measurements are recorded into "wide" rows. Rows may contain all measurements or only some. The other columns are then set to NULL
.
I would like to use timebucket_gapfill()
to "clean" this table and make sure that every row in the output has data in all columns, even if the underlying dataset has some null values for some of the columns.
This is how I prepare the table with some data (schema from the getting started guide):
CREATE TABLE conditions (
time TIMESTAMPTZ NOT NULL,
location TEXT NOT NULL,
temperature DOUBLE PRECISION NULL,
humidity DOUBLE PRECISION NULL
);
SELECT create_hypertable('conditions', 'time');
INSERT INTO conditions(time, location, temperature, humidity)
VALUES ('2019-07-10 05:02:14-07', 'office', 70.0, 50.0);
INSERT INTO conditions(time, location, temperature, humidity)
VALUES ('2019-07-10 05:02:15-07', 'office', 71.0, null);
INSERT INTO conditions(time, location, temperature, humidity)
VALUES ('2019-07-10 05:02:16-07', 'office', 72.0, 48.0);
-- gap at 2019-07-10 05:02:17-07
INSERT INTO conditions(time, location, temperature, humidity)
VALUES ('2019-07-10 05:02:18-07', 'office', 72.0, 48.0);
INSERT INTO conditions(time, location, temperature, humidity)
VALUES ('2019-07-10 05:02:18.8-07', 'office', 72.1, NULL);
INSERT INTO conditions(time, location, temperature, humidity)
VALUES ('2019-07-10 05:02:19.2-07', 'office', NULL, 46.0);
INSERT INTO conditions(time, location, temperature, humidity)
VALUES ('2019-07-10 05:02:20-07', 'office', 73.0, 45.0);
And this is how I query it:
SELECT
time_bucket_gapfill('1000ms', time,
start => '2019-07-10 05:02:13',
finish => '2019-07-10 05:02:21'
) as ival,
count(*) as samplesUsed,
interpolate(avg(temperature)) as lineartemperature,
interpolate(avg(humidity)) as linearhumidity
FROM conditions
GROUP BY ival
ORDER BY ival;
The output is:
ival | samplesused | lineartemperature | linearhumidity
------------------------+-------------+-------------------+----------------
2019-07-10 05:02:13-07 | | |
2019-07-10 05:02:14-07 | 1 | 70 | 50
2019-07-10 05:02:15-07 | 1 | 71 |
2019-07-10 05:02:16-07 | 1 | 72 | 48
2019-07-10 05:02:17-07 | | 72.025 | 48
2019-07-10 05:02:18-07 | 2 | 72.05 | 48
2019-07-10 05:02:19-07 | 1 | | 46
2019-07-10 05:02:20-07 | 1 | 73 | 45
How do I write the query to return an interpolated value for all measurement columns?
Upvotes: 1
Views: 1591
Reputation: 4675
Timescaledb does not consider NULL as missing values. I have to rewrite the query to avoid the rows with NULL values, that means doing multiple queries with timebucket_gapfill
and joining the results together.
This works and does what I wanted:
SELECT
condh.ival, humidity, temperature
from
(
select
time_bucket_gapfill('1000ms', time,
start => '2019-07-10 05:02:13',
finish => '2019-07-10 05:02:21'
) as ival,
count(*) as samplesUsed,
interpolate(avg(humidity)) as humidity
FROM conditions
WHERE humidity is not NULL
GROUP BY ival
) condh
INNER JOIN
(
SELECT
time_bucket_gapfill('1000ms', time,
start => '2019-07-10 05:02:13',
finish => '2019-07-10 05:02:21'
) as ival,
count(*) as samplesUsed,
interpolate(avg(temperature)) as temperature
FROM conditions
WHERE temperature is not NULL
GROUP BY ival
) condt
on (condt.ival = condh.ival)
ORDER BY ival;
Output:
ival | humidity | temperature
------------------------+----------+-------------
2019-07-10 05:02:13-07 | |
2019-07-10 05:02:14-07 | 50 | 70
2019-07-10 05:02:15-07 | 49 | 71
2019-07-10 05:02:16-07 | 48 | 72
2019-07-10 05:02:17-07 | 48 | 72.025
2019-07-10 05:02:18-07 | 48 | 72.05
2019-07-10 05:02:19-07 | 46 | 72.525
2019-07-10 05:02:20-07 | 45 | 73
(8 rows)
Got some help on the timescaledb slack - thanks gayathri.
Upvotes: 2