ademait
ademait

Reputation: 43

Group by bursts of occurences in TimescaleDB/PostgreSQL

this is my first question in stackoverflow, any advice on how to ask a well structured question will be welcomed.

So, I have a TimescaleDB database, which is time-series databases built over Postgres. It has most of its functionalities, so if any of you don't know about Timescale it won't be an issue. I have a select statement which returns:

          time          | num_issues |   actor_login
------------------------+------------+------------------
 2015-11-10 01:00:00+01 |          2 | nifl
 2015-12-10 01:00:00+01 |          1 | anandtrex
 2016-01-09 01:00:00+01 |          1 | isaacrg
 2016-02-08 01:00:00+01 |          1 | timbarclay
 2016-06-07 02:00:00+02 |          1 | kcalmes
 2016-07-07 02:00:00+02 |          1 | cassiozen
 2016-08-06 02:00:00+02 |         13 | phae
 2016-09-05 02:00:00+02 |          2 | phae
 2016-10-05 02:00:00+02 |         13 | cassiozen
 2016-11-04 01:00:00+01 |          6 | cassiozen
 2016-12-04 01:00:00+01 |          4 | cassiozen
 2017-01-03 01:00:00+01 |          5 | cassiozen
 2017-02-02 01:00:00+01 |          8 | cassandraoid
 2017-03-04 01:00:00+01 |         16 | erquhart
 2017-04-03 02:00:00+02 |          3 | erquhart
 2017-05-03 02:00:00+02 |          9 | erquhart
 2017-06-02 02:00:00+02 |          5 | erquhart
 2017-07-02 02:00:00+02 |          2 | greatwarlive
 2017-08-01 02:00:00+02 |          8 | tech4him1
 2017-08-31 02:00:00+02 |          7 | tech4him1
 2017-09-30 02:00:00+02 |         17 | erquhart
 2017-10-30 01:00:00+01 |          7 | erquhart
 2017-11-29 01:00:00+01 |         12 | erquhart
 2017-12-29 01:00:00+01 |          8 | tech4him1
 2018-01-28 01:00:00+01 |          6 | ragasirtahk

And it follows. Basically it returns a username in a bucket of time, in this case 30 days. The SQL query is:

SELECT DISTINCT ON(time_bucket('30 days', created_at))
  time_bucket('30 days', created_at) as time,
  count(id) as num_issues,
  actor_login
FROM
  issues_event
WHERE action = 'opened' AND repo_name='netlify/netlify-cms'
group by time, actor_login
order by time, num_issues DESC

My question is, how can i detect or group the rows which have equal actor_login and are consecutive. For example, I would like to group the cassiozen from 2016-10-05 to 2017-01-03, but not with the other cassiozen of the column. I have tried with auxiliar columns, with window functions such as LAG, but without a function or a do statement I don't think it is possible. I also tried with functions but I can't find a way.

Any approach, idea or solution will be fully appreciated.

Edit: I show my desired output.

          time          | num_issues |   actor_login    | actor_group_id
------------------------+------------+------------------+----------------
 2015-11-10 01:00:00+01 |          2 | nifl             |              0
 2015-12-10 01:00:00+01 |          1 | anandtrex        |              1
 2016-01-09 01:00:00+01 |          1 | isaacrg          |              2
 2016-02-08 01:00:00+01 |          1 | timbarclay       |              3
 2016-06-07 02:00:00+02 |          1 | kcalmes          |              4
 2016-07-07 02:00:00+02 |          1 | cassiozen        |              5
 2016-08-06 02:00:00+02 |         13 | phae             |              6
 2016-09-05 02:00:00+02 |          2 | phae             |              6
 2016-10-05 02:00:00+02 |         13 | cassiozen        |              7
 2016-11-04 01:00:00+01 |          6 | cassiozen        |              7
 2016-12-04 01:00:00+01 |          4 | cassiozen        |              7
 2017-01-03 01:00:00+01 |          5 | cassiozen        |              7
 2017-02-02 01:00:00+01 |          8 | cassandraoid     |             12
 2017-03-04 01:00:00+01 |         16 | erquhart         |             13
 2017-04-03 02:00:00+02 |          3 | erquhart         |             13
 2017-05-03 02:00:00+02 |          9 | erquhart         |             13
 2017-06-02 02:00:00+02 |          5 | erquhart         |             13
 2017-07-02 02:00:00+02 |          2 | greatwarlive     |             17
 2017-08-01 02:00:00+02 |          8 | tech4him1        |             18
 2017-08-31 02:00:00+02 |          7 | tech4him1        |             18
 2017-09-30 02:00:00+02 |         17 | erquhart         |             16
 2017-10-30 01:00:00+01 |          7 | erquhart         |             16
 2017-11-29 01:00:00+01 |         12 | erquhart         |             16
 2017-12-29 01:00:00+01 |          8 | tech4him1        |             21
 2018-01-28 01:00:00+01 |          6 | ragasirtahk      |             24

The solution of MatBaille is almost perfect. I just wanted to group the consecutive actors like this so I could extract a bunch of metrics with other attributes of the table.

Upvotes: 3

Views: 188

Answers (1)

MatBailie
MatBailie

Reputation: 86716

You could use a so-called "gaps-and-islands" approach

WITH
  sorted AS
(
  SELECT
    *,
    ROW_NUMBER() OVER (                         ORDER BY time)   AS rn,
    ROW_NUMBER() OVER (PARTITION BY actor_login ORDER BY time)   AS rn_actor
  FROM
    your_results
)
SELECT
  *,
  rn - rn_actor  AS actor_group_id
FROM
  sorted

Then the combination of (actor_login, actor_group_id) will group consecutive rows together.

db<>fiddle demo

Upvotes: 2

Related Questions