DogEatDog
DogEatDog

Reputation: 3047

SQL - Find records in continuous time series as groups in PostgreSQL

I have a timeseries data in which I'm trying to find continous streaks of records in order within a certain time interval, grouping by each series of continual records. For example, if the time interval is 5 minutes between records for each set (group) any record next record that was within 5 minutes would be in the set and any record over 5 minutes would end the set (group). A new set (group) would emerge from the next two records within 5 minutes.

**Minute seperation is not in the table, it is calculated in the query

|    |                     |                                   |          |                                                            | 
|----|---------------------|-----------------------------------|----------|------------------------------------------------------------| 
| id | timestamp           | minute separation (Calculated **) | group    | notes                                                      | 
| 1  | 2018-02-13T01:18:00 | 0                                 | Group 1  |                                                            | 
| 2  | 2018-02-13T01:22:00 | 4                                 | Group 1  |                                                            | 
| 3  | 2018-02-13T01:25:00 | 3                                 | Group 1  |                                                            | 
| 4  | 2018-02-13T01:31:00 | 6                                 | No Group | breaks group 1                                             | 
| 5  | 2018-02-13T01:38:00 | 7                                 | No Group | not within interval on either side                         | 
| 6  | 2018-02-13T01:44:00 | 6                                 | Group 2  | Start of group 2                                           | 
| 7  | 2018-02-13T01:47:00 | 3                                 | Group 2  |                                                            | 
| 8  | 2018-02-13T01:48:00 | 1                                 | Group 2  |                                                            | 
| 9  | 2018-02-13T01:49:00 | 1                                 | Group 2  |                                                            | 
| 10 | 2018-02-13T01:51:00 | 2                                 | Group 2  |                                                            | 
| 11 | 2018-02-13T01:57:00 | 6                                 | Group 3  | Breaks Group 2, included in next group as start of group 3 | 
| 12 | 2018-02-13T01:59:00 | 2                                 | Group 3  |                                                            | 
| 13 | 2018-02-13T02:01:00 | 2                                 | Group 3  |                                                            | 
| 14 | 2018-02-13T02:02:00 | 1                                 | Group 3  |                                                            | 
| 15 | 2018-02-13T02:08:00 | 6                                 | No Group | Breaks group 3                                             | 
| 16 | 2018-02-13T02:15:00 | 7                                 | No Group |                                                            | 
| 17 | 2018-02-13T02:22:00 | 7                                 | No Group |                                                            | 

I can find the difference between 2 rows in order and extract them, but I'm not sure how I would extract a series of n+1 rows.

My Working sample is here: http://sqlfiddle.com/#!17/e9fa1/7

Sample Data SQL Insert:

CREATE TABLE time_series (
  id SERIAL UNIQUE, 
  name TEXT,
  timestamp TIMESTAMPTZ
);

INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:18:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:22:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:25:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:31:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:38:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:44:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:47:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:48:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:49:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:51:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:57:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:59:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:01:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:02:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:08:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:15:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:22:00');

Working Query:

WITH my_series AS
(
  SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp) AS rn
    FROM time_series 
)
SELECT o1.id AS id1, o1.timestamp AS date1, o2.id AS id2, o2.timestamp  AS date2, ABS(EXTRACT(EPOCH FROM (o1.timestamp - o2.timestamp))) AS diff

  FROM my_series as o1 JOIN my_series  as o2
    ON o1.rn + 1 = o2.rn
  WHERE ABS(EXTRACT(EPOCH FROM (o1.timestamp - o2.timestamp))) < 300;

I need to either return each group as one set (that will work as I should be able to repeat the query noting the timestamp that the query ended to start the next one, although that does not seem very efficent), or return all the groups labeled as group, like this:

Desired result:

|    |                     |                                   |          |                                                            | 
|----|---------------------|-----------------------------------|----------|------------------------------------------------------------| 
| id | timestamp           | minute separation (Calculated **) | group    | notes                                                      | 
| 1  | 2018-02-13T01:18:00 | 0                                 | 1        |                                                            | 
| 2  | 2018-02-13T01:22:00 | 4                                 | 1        |                                                            | 
| 3  | 2018-02-13T01:25:00 | 3                                 | 1        |                                                            | 
| 6  | 2018-02-13T01:44:00 | 6                                 | 2        |                                                            | 
| 7  | 2018-02-13T01:47:00 | 3                                 | 2        |                                                            | 
| 8  | 2018-02-13T01:48:00 | 1                                 | 2        |                                                            | 
| 9  | 2018-02-13T01:49:00 | 1                                 | 2        |                                                            | 
| 10 | 2018-02-13T01:51:00 | 2                                 | 2        |                                                            | 
| 11 | 2018-02-13T01:57:00 | 6                                 | 3        |                                                            | 
| 12 | 2018-02-13T01:59:00 | 2                                 | 3        |                                                            | 
| 13 | 2018-02-13T02:01:00 | 2                                 | 3        |                                                            | 
| 14 | 2018-02-13T02:02:00 | 1                                 | 3        |                                                            | 

Upvotes: 4

Views: 1117

Answers (2)

Aditya Bhattacharya
Aditya Bhattacharya

Reputation: 1014

There is something similar which I had implemented with monthly grouped data. Please let me know if a similar solution can be applied for this case:

https://stackoverflow.com/a/59961217/7357308

and I have a blog regarding same as well: https://medium.com/@adib0073/quick-fix-to-importing-time-series-data-from-sql-server-93ad77f9b3d0

Please let me know if this helps.

Upvotes: -1

Gordon Linoff
Gordon Linoff

Reputation: 1270463

You want to use lag() and lead(). I would enumerate all the groups, even the ones with only one row:

select s.*,
       sum( (timestamp >= prev_timestamp + interval '5 minute' or prev_timestamp is null)::int ) over (order by timestamp) as grp
from (select s.*,
             lag(timestamp) over (order by timestamp) as prev_timestamp,
             lead(timestamp) over (order by timestamp) as next_timestamp
      from my_series s
     ) s;

The logic is a bit more complicated for your actual question. I think this does the trick:

select s.*,
       (case when timestamp > prev_timestamp + interval '5 minute' and
                  timestamp < next_timestamp - interval '5 minute' and
             then NULL
             else sum( (timestamp >= prev_timestamp + interval '5 minute' and next_timestamp < timestamp + interval '5 minute')::int ) over (order by timestamp) 
        end) as grp
from (select s.*,
             lag(timestamp) over (order by timestamp) as prev_timestamp,
             lead(timestamp) over (order by timestamp) as next_timestamp
      from my_series s
     ) s;

Upvotes: 5

Related Questions