Reputation: 671
Using Amazon Redshift (SQL), I've got a table of timestamps that I'd like to split into separate phases when the time between entries is above some threshold.
For example, using a threshold of 60 units for this input:
id ts
1 a 1
2 a 4
3 a 12
4 a 90
5 a 94
6 a 101
7 a 404
8 a 412
9 a 413
I'd like to return this:
id ts dt phase
1 a 1 NA 1
2 a 4 3 1
3 a 12 8 1
4 a 90 78 2
5 a 94 4 2
6 a 101 7 2
7 a 404 303 3
8 a 412 8 3
9 a 413 1 3
This is straightforward in R (which I'm most familiar with) using a simple for
loop and ifelse
which increments the previous phase
value by 1 if dt
> 60:
# sample data
df <- data.frame(id = rep("a", 9),
ts = c(1, 4, 12, 90, 94, 101, 404, 412, 413)) %>%
mutate(dt = c(NA, diff(ts)))
# add default minimum phase value, 1
df$phase<- 1
# for loop
for(i in 2:nrow(df)) {
df$phase[i] <- ifelse(df$dt[i] > 60, df$phase[i-1] + 1, df$phase[i-1])
}
However, my attempts using the lag
function and case
/ when
in SQL have not been successful.
-- sample data
CREATE TABLE sampledata (
conversationid varchar(10), ts integer
);
INSERT INTO sampledata (conversationid, ts)
VALUES
('a', 1),
('a', 4),
('a', 12),
('a', 90),
('a', 94),
('a', 101),
('a', 404),
('a', 412),
('a', 413);
-- query
SELECT *,
CASE
WHEN dt > 60 THEN LAG(period) OVER (PARTITION BY conversationid ORDER BY ts) + 1
ELSE LAG(period) OVER (PARTITION BY conversationid ORDER BY ts)
END AS period
FROM (
SELECT *,
ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts) AS dt,
1 AS period
FROM sampledata
)
ORDER BY ts
;
-- output
id ts dt period period
a 1 1
a 4 3 1 1
a 12 8 1 1
a 90 78 1 2
a 94 4 1 1
a 101 7 1 1
a 404 303 1 2
a 412 8 1 1
a 413 1 1 1
I'm able to increment the phase value on rows where dt
> 60, but not propagate the incremented phase
value across subsequent rows.
I guess this may be something to do with the lag
function operating across all rows at once rather than row-by-row and/or being unable to update the original phase
value on the fly (a second column phase
is instead created).
Upvotes: 0
Views: 2076
Reputation: 1269463
You are close. You want a cumulative sum based on the lag difference:
SELECT sd.*,
SUM(CASE WHEN diff > 60 THEN 1 ELSE 0 END) OVER (PARTITION BY conversationid ORDER BY ts) as period
FROM (SELECT sd.*,
(ts - LAG(ts) OVER (PARTITION BY conversationid ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ) AS diff
FROM sampledata sd
) sd
ORDER BY ts;
As a side note, I would expect you to use ORDER BY conversationid, ts
, rather than just the time.
And finally, the above will start the periods at NULL
(it should correctly identify them, just naming them awkwardly). The following tweak does the enumeration as you specifically request:
SELECT sd.*,
(1 + SUM(CASE WHEN diff < 60 THEN 0 ELSE 1 END) OVER (PARTITION BY conversationid ORDER BY ts ROW BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as period
Upvotes: 1