Reputation: 19496
So I've got a dataset in athena, so for the purposes of this you can probably treat it as a postgres database. The a sample of the data can be seen in this sql fiddle.
Here's a sample:
create table vals (
timestamp int,
type varchar(25),
val int
);
insert into vals(timestamp,type, val)
values (10, null, 1),
(20, null, 2),
(39, null, 1),
(40,'p',1),
(50,'p',2),
(60,'p',1),
(70,'v',5),
(80,'v',6),
(90,'v',6),
(100,'v',3),
(110,null,3),
(120,'v',6),
(130,null,3),
(140,'p',10),
(150,'p',8),
(160,null,3),
(170,'p',1),
(180,'p',2),
(190,'p',2),
(200,'p',1),
(210,null,3),
(220,'v',1),
(230,'v',1),
(240,'v',3),
(250,'v',41)
What I'd like to get is a dataset that includes all the values but highlights the highest value of the 'p' and the lowest value of the contiguous 'v's.
so ultimately I'd get:
timestamp, type, value, is_peak
(10, null, 1, null),
(20, null, 2, null),
(39, null, 1, null),
(40,'p',1, null),
(50,'p',2, 1),
(60,'p',1, null),
(70,'v',5, null),
(80,'v',6, null),
(90,'v',6, null),
(100,'v',3, 1),
(110,null,3, null),
(120,'v',6, 1),
(130,null,3, null),
(140,'p',10, 1),
(150,'p',8, null),
(160,null,3, null),
(170,'p',1, null),
(180,'p',2, 1),
(190,'p',2, null), -- either this record or 180 would be fine
(200,'p',1, null),
(210,null,3, null),
(220,'v',1, 1), -- again either this or 230
(230,'v',1, null),
(240,'v',3, null),
(250,'v',41, null)
is peak has a lot of choices for the type, if it were some sort of denserank or incrementing number that'd be fine. Just so I can be confident that within the contiguous set the 'marked' one is the highest or lowest value.
Good luck appreciate the assist
note: The peak of the max of the peaks or min of the valley's can be anwhere in the contiguous set, but as soon as the type changes we start over.
Upvotes: 3
Views: 2183
Reputation: 29667
There's a little trick that can be used for a gaps-and-islands problem like this one.
By subtracting a row_number from a row_number over a value you get some ranking.
There's some disadvantages with this method for certain purposes.
But it works for this case.
Once that ranking is calculated then it can be used by the other window functions in the outer query.
And we can again use row_number for this.
But depending on the requirement, you could use DENSE_RANK or the window function of MIN & MAX instead.
Then we just wrap those in a CASE
for the different logic depending on the type.
select timestamp, type, val,
(case
when type = 'v' and row_number() over (partition by (rn1-rn2), type order by val, rn1) = 1 then 1
when type = 'p' and row_number() over (partition by (rn1-rn2), type order by val desc, rn1) = 1 then 1
end) is_peak
-- , rn1, rn2, (rn1-rn2) as rnk
from
(
select timestamp, type, val,
row_number() over (order by timestamp) as rn1,
row_number() over (partition by type order by timestamp) as rn2
from vals
) q
order by timestamp;
You can test a SQL Fiddle here
Returns:
timestamp type val is_peak
--------- ---- ---- -------
10 null 1 null
20 null 2 null
39 null 1 null
40 p 1 null
50 p 2 1
60 p 1 null
70 v 5 null
80 v 6 null
90 v 6 null
100 v 3 1
110 null 3 null
120 v 6 1
130 null 3 null
140 p 10 1
150 p 8 null
160 null 3 null
170 p 1 null
180 p 2 1
190 p 2 null
200 p 1 null
210 null 3 null
220 v 1 1
230 v 1 null
240 v 3 null
250 v 41 null
Upvotes: 3
Reputation: 175746
You could use LEAD/LAG window functions:
SELECT *,
CASE WHEN type = 'p' AND val>LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
AND val > LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type = 'v' AND val<LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
AND val < LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM vals
ORDER BY timestamp;
Output:
┌───────────┬───────┬──────┬─────────┐
│ timestamp │ type │ val │ is_peak │
├───────────┼───────┼──────┼─────────┤
│ 10 │ │ 1 │ │
│ 20 │ │ 2 │ │
│ 39 │ │ 1 │ │
│ 40 │ p │ 1 │ │
│ 50 │ p │ 2 │ 1 │
│ 60 │ p │ 1 │ │
│ 70 │ v │ 5 │ │
│ 80 │ v │ 6 │ │
│ 90 │ v │ 6 │ │
│ 100 │ v │ 3 │ 1 │
│ 110 │ │ 3 │ │
│ 120 │ v │ 6 │ │
│ 130 │ │ 3 │ │
│ 140 │ p │ 10 │ 1 │
│ 150 │ p │ 8 │ │
└───────────┴───────┴──────┴─────────┘
Version with window clause:
SELECT *, CASE WHEN type = 'p' AND val > LAG(val) OVER s
AND val > LEAD(val) OVER s THEN 1
WHEN type = 'v' AND val < LAG(val) OVER s
AND val < LEAD(val) OVER s THEN 1
END AS is_peak
FROM vals
WINDOW s AS (PARTITION BY type ORDER BY timestamp)
ORDER BY timestamp;
EDIT
I think with a hopefully small change we can get timestamp 120 also, then that'll be it
SELECT *,CASE
WHEN type IN ('p','v') AND val > LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
AND val > LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type IN ('v') AND val < LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
AND val < LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM vals
ORDER BY timestamp;
EDIT 2:
Final solution with gaps-and-islands
detection(handling plateau):
WITH cte AS (
SELECT *, LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) AS l
FROM vals
), cte2 AS (
SELECT *, SUM(CASE WHEN val = l THEN 1 ELSE 0 END) OVER(PARTITION BY type ORDER BY timestamp) AS dr
FROM cte
), cte3 AS (
SELECT *, CASE WHEN type IN ('p') AND val > LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
AND val >= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
WHEN type IN ('v') AND val < LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
AND val <= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1
END AS is_peak
FROM cte2
)
SELECT timestamp, type, val,
CASE WHEN is_peak = 1 THEN 1
WHEN EXISTS (SELECT 1 FROM cte3 cx
WHERE cx.is_peak = 1
AND cx.val = cte3.val
AND cx.type = cte3.type
AND cx.dr = cte3.dr)
THEN 1
END is_peak
FROM cte3
ORDER BY timestamp;
Output:
┌────────────┬───────┬──────┬─────────┐
│ timestamp │ type │ val │ is_peak │
├────────────┼───────┼──────┼─────────┤
│ 10 │ │ 1 │ │
│ 20 │ │ 2 │ │
│ 39 │ │ 1 │ │
│ 40 │ p │ 1 │ │
│ 50 │ p │ 2 │ 1 │
│ 60 │ p │ 1 │ │
│ 70 │ v │ 5 │ │
│ 80 │ v │ 6 │ │
│ 90 │ v │ 6 │ │
│ 100 │ v │ 3 │ 1 │
│ 110 │ │ 3 │ │
│ 120 │ v │ 6 │ │
│ 130 │ │ 3 │ │
│ 140 │ p │ 10 │ 1 │
│ 150 │ p │ 8 │ │
│ 160 │ │ 3 │ │
│ 170 │ p │ 1 │ │
│ 180 │ p │ 2 │ 1 │
│ 190 │ p │ 2 │ 1 │
│ 200 │ p │ 1 │ │
│ 210 │ │ 3 │ │
│ 220 │ v │ 1 │ 1 │
│ 230 │ v │ 1 │ 1 │
│ 240 │ v │ 3 │ │
│ 250 │ v │ 41 │ │
└────────────┴───────┴──────┴─────────┘
Additional note:
ISO SQL:2016 adds pattern matching MATCH_RECOGNIZE for this kind of scenarios where you define regexp for peak like PATTERN (STRT UP+ FLAT* DOWN+)
but currently it is supported only by Oracle.
Related article: Modern SQL - match_recognize Regular Expressions Over Rows
Upvotes: 3
Reputation: 3351
You could use subqueries within a case
statement to achieve this:
create table #vals
(
[timestamp] int,
[type] varchar(25),
val int
);
insert into #vals ([timestamp], [type], val)
values (10, null, 1),
(20, null, 2),
(30, null, 1),
(40,'p',1),
(50,'p',2),
(60,'p',1),
(70,'v',5),
(80,'v',6),
(90,'v',6),
(100,'v',3),
(110,null,3)
select
r.*,
case
when r.[type] = 'p' and not exists (select * from #vals c where c.[type] = r.[type] and c.val > r.val) then 1
when r.[type] = 'v' and not exists (select * from #vals c where c.[type] = r.[type] and c.val < r.val) then 1
else null
end as is_peak
from #vals r
drop table #vals
Results:
/----------------------------------\
| timestamp | type | val | is_peak |
|-----------|------|-----|---------|
| 10 | NULL | 1 | NULL |
| 20 | NULL | 2 | NULL |
| 30 | NULL | 1 | NULL |
| 40 | p | 1 | NULL |
| 50 | p | 2 | 1 |
| 60 | p | 1 | NULL |
| 70 | v | 5 | NULL |
| 80 | v | 6 | NULL |
| 90 | v | 6 | NULL |
| 100 | v | 3 | 1 |
| 110 | NULL | 3 | NULL |
\----------------------------------/
Note: If there are multiple records with the same (peak) val
, they would each be marked with a 1
in the is_peak
column.
Upvotes: 1