Reputation: 131
I have a hypertable in timescaledb called prices
that looks like the following:
ts | instrumentId | value |
---|---|---|
08:00:01 | A | 100 |
08:00:01 | B | 200 |
08:00:02 | B | 205 |
08:00:04 | A | 95 |
08:00:06 | C | 300 |
08:00:07 | A | 90 |
Where each row is a measurement (value
) at a given timestamp (ts
) for a given instrument (instrumentId
). Each instrument is coming in at different times, and I have no guarantee that I'll have a datapoint for a given instrument at a given timestamp.
My hypertable is using the timestamp column as my time index, and for additional context, I have ~200M rows spanning several years of sub-second data for dozens of instruments.
I want to be able to generate 5 second unweighted historical means for each instrument for each measurement that I have. For each row in my table, I want to be able to find all of the rows which have the same instrument, and have a timestamp between (row_timestamp, row_timestamp - 5 seconds)
(This set would include the row that we're generating the data for). I'd then like to take the average of that, but it's reasonable that I'd want to take other statistics from this as well (stdev, sum, etc).
The output of this query for the above example table would look like the below:
ts | instrumentId | avg_5s_window_value |
---|---|---|
08:00:01 | A | 100 |
08:00:01 | B | 200 |
08:00:02 | B | 202.5 |
08:00:04 | A | 97.5 |
08:00:06 | C | 300 |
08:00:07 | A | 92.5 |
I can achieve this in small batches of data by joining the table against itself, however this a very inefficient solution, and I know that the correct way to achieve this would have some kind of deque under the hood. The query for this looks like the following:
WITH lhs AS (
SELECT
ts, instrumentId, value, rank() OVER (ORDER BY ts)
FROM prices
),
splay AS (
SELECT
lhs.instrumentId, lhs.ts, lhs.rank, rhs.value
FROM lhs as rhs
JOIN lhs
ON
lhs.instrumentId=rhs.instrumentId
AND rhs.ts BETWEEN lhs.ts - INTERVAL '5s' AND lhs.ts
ORDER BY lhs.instrumentId, lhs.rank, rhs.rank
)
SELECT
MAX(instrumentId), min(ts), AVG(value)
FROM splay
GROUP BY rank;
The above takes ~2 minutes to run for a single day of data, but I can do the operation in pandas on 1 year of data in ~45 seconds, so I'm confident there's a better way to do it in SQL.
How can I achieve efficient grouped window functions in timescaledb/postgres?
Upvotes: 3
Views: 1137
Reputation: 1053
I think what you're looking for is time_bucket
with a group by, so something like:
SELECT time_bucket('5s'::interval, ts), instrumentID, avg(value)
FROM prices
GROUP BY time_bucket('5s'::interval, ts), instrumentID;
That'll group by the timestamp at the beginning of the range, if you want the end, you can just add '5s' to it...
If you're trying to do this for each row in your original dataset, then you'd use a window function with a partition by clause and range clause like so:
SELECT *, avg(value) OVER (PARTITION BY instrumentId ORDER BY ts RANGE '5s' PRECEDING)
FROM prices;
Upvotes: 4