jogall
jogall

Reputation: 671

Conditionally increment previous value and propagate forward using Amazon Redshift (SQL)

Using Amazon Redshift (SQL), I've got a table of timestamps that I'd like to split into separate phases when the time between entries is above some threshold.

For example, using a threshold of 60 units for this input:

  id  ts
1  a   1
2  a   4
3  a  12
4  a  90
5  a  94
6  a 101
7  a 404
8  a 412
9  a 413

I'd like to return this:

  id  ts  dt phase
1  a   1  NA     1
2  a   4   3     1
3  a  12   8     1
4  a  90  78     2
5  a  94   4     2
6  a 101   7     2
7  a 404 303     3
8  a 412   8     3
9  a 413   1     3

This is straightforward in R (which I'm most familiar with) using a simple for loop and ifelse which increments the previous phase value by 1 if dt > 60:

# sample data
df <- data.frame(id = rep("a", 9),
                 ts = c(1, 4, 12, 90, 94, 101, 404, 412, 413)) %>%
  mutate(dt = c(NA, diff(ts)))

# add default minimum phase value, 1
df$phase<- 1
# for loop
for(i in 2:nrow(df)) {
  df$phase[i] <- ifelse(df$dt[i] > 60, df$phase[i-1] + 1, df$phase[i-1])
}

However, my attempts using the lag function and case / when in SQL have not been successful.

-- sample data
CREATE TABLE sampledata (
  conversationid varchar(10), ts integer
);

INSERT INTO sampledata (conversationid, ts)
VALUES
  ('a', 1),
  ('a', 4),
  ('a', 12),
  ('a', 90),
  ('a', 94),
  ('a', 101),
  ('a', 404),
  ('a', 412),
  ('a', 413);

-- query
SELECT *,
  CASE
    WHEN dt > 60 THEN LAG(period) OVER (PARTITION BY conversationid ORDER BY ts) + 1
    ELSE LAG(period) OVER (PARTITION BY conversationid ORDER BY ts)
  END AS period
FROM (
  SELECT *,
    ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts) AS dt,
    1 AS period
  FROM sampledata
)
ORDER BY ts
;

-- output
id ts   dt  period period
a  1        1   
a  4    3   1      1
a  12   8   1      1
a  90   78  1      2
a  94   4   1      1
a  101  7   1      1
a  404  303 1      2
a  412  8   1      1
a  413  1   1      1

I'm able to increment the phase value on rows where dt > 60, but not propagate the incremented phase value across subsequent rows.

I guess this may be something to do with the lag function operating across all rows at once rather than row-by-row and/or being unable to update the original phase value on the fly (a second column phase is instead created).

Upvotes: 0

Views: 2076

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269463

You are close. You want a cumulative sum based on the lag difference:

SELECT sd.*,
       SUM(CASE WHEN diff > 60 THEN 1 ELSE 0 END) OVER (PARTITION BY conversationid ORDER BY ts) as period
FROM (SELECT sd.*,
             (ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ) AS diff
      FROM sampledata sd
     ) sd
ORDER BY ts;

As a side note, I would expect you to use ORDER BY conversationid, ts, rather than just the time.

And finally, the above will start the periods at NULL (it should correctly identify them, just naming them awkwardly). The following tweak does the enumeration as you specifically request:

SELECT sd.*,
       (1 + SUM(CASE WHEN diff < 60 THEN 0 ELSE 1 END) OVER (PARTITION BY conversationid ORDER BY ts ROW BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as period

Upvotes: 1

Related Questions