Nathan Feger
Nathan Feger

Reputation: 19496

How to find peak and valley in running sequence in sql

So I've got a dataset in athena, so for the purposes of this you can probably treat it as a postgres database. The a sample of the data can be seen in this sql fiddle.

Here's a sample:

  create table vals (
  timestamp int,
  type varchar(25),
  val int
  );

  insert into vals(timestamp,type, val) 
  values      (10, null, 1),
              (20, null, 2),
              (39, null, 1),
              (40,'p',1),
              (50,'p',2),
              (60,'p',1),
              (70,'v',5),
              (80,'v',6),
              (90,'v',6),
              (100,'v',3),
              (110,null,3),
              (120,'v',6),
              (130,null,3),
              (140,'p',10),
              (150,'p',8),
              (160,null,3),
              (170,'p',1),
              (180,'p',2),
              (190,'p',2),
              (200,'p',1),
              (210,null,3),
              (220,'v',1),
              (230,'v',1),
              (240,'v',3),
              (250,'v',41)               

What I'd like to get is a dataset that includes all the values but highlights the highest value of the 'p' and the lowest value of the contiguous 'v's.

so ultimately I'd get:

   timestamp, type, value, is_peak
    (10, null, 1, null),
    (20, null, 2, null),
    (39, null, 1, null),
    (40,'p',1, null),
    (50,'p',2, 1),
    (60,'p',1, null),
    (70,'v',5, null),
    (80,'v',6, null),
    (90,'v',6, null),
    (100,'v',3, 1),
    (110,null,3, null),
    (120,'v',6, 1),
    (130,null,3, null),
    (140,'p',10, 1),
    (150,'p',8, null),
    (160,null,3, null),
    (170,'p',1, null),
    (180,'p',2, 1),
    (190,'p',2, null), -- either this record or 180 would be fine
    (200,'p',1, null),
    (210,null,3, null),
    (220,'v',1, 1), -- again either this or 230
    (230,'v',1, null),
    (240,'v',3, null),
    (250,'v',41, null) 

is peak has a lot of choices for the type, if it were some sort of denserank or incrementing number that'd be fine. Just so I can be confident that within the contiguous set the 'marked' one is the highest or lowest value.

Good luck appreciate the assist

note: The peak of the max of the peaks or min of the valley's can be anwhere in the contiguous set, but as soon as the type changes we start over.

Upvotes: 3

Views: 2183

Answers (3)

LukStorms
LukStorms

Reputation: 29667

There's a little trick that can be used for a gaps-and-islands problem like this one.

By subtracting a row_number from a row_number over a value you get some ranking.

There's some disadvantages with this method for certain purposes.
But it works for this case.

Once that ranking is calculated then it can be used by the other window functions in the outer query.
And we can again use row_number for this. But depending on the requirement, you could use DENSE_RANK or the window function of MIN & MAX instead.

Then we just wrap those in a CASE for the different logic depending on the type.

select timestamp, type, val, 
(case 
 when type = 'v' and row_number() over (partition by (rn1-rn2), type order by val, rn1) = 1 then 1
 when type = 'p' and row_number() over (partition by (rn1-rn2), type order by val desc, rn1) = 1 then 1
 end) is_peak
-- , rn1, rn2, (rn1-rn2) as rnk
from
(
  select timestamp, type, val,
   row_number() over (order by timestamp) as rn1,
   row_number() over (partition by type order by timestamp) as rn2
  from vals
) q
order by timestamp;

You can test a SQL Fiddle here

Returns:

timestamp   type    val     is_peak
---------   ----    ----    -------
10          null    1       null
20          null    2       null
39          null    1       null
40          p       1       null
50          p       2       1
60          p       1       null
70          v       5       null
80          v       6       null
90          v       6       null
100         v       3       1
110         null    3       null
120         v       6       1
130         null    3       null
140         p       10      1
150         p       8       null
160         null    3       null
170         p       1       null
180         p       2       1
190         p       2       null
200         p       1       null
210         null    3       null
220         v       1       1
230         v       1       null
240         v       3       null
250         v       41      null

Upvotes: 3

Lukasz Szozda
Lukasz Szozda

Reputation: 175746

You could use LEAD/LAG window functions:

SELECT *,
  CASE WHEN type = 'p' AND val>LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
        AND val > LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
       WHEN type = 'v' AND val<LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
       AND val < LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
  END AS is_peak
FROM vals
ORDER BY timestamp;

db<>fiddle demo

Output:

┌───────────┬───────┬──────┬─────────┐
│ timestamp │ type  │ val  │ is_peak │
├───────────┼───────┼──────┼─────────┤
│       10  │       │   1  │         │
│       20  │       │   2  │         │
│       39  │       │   1  │         │
│       40  │ p     │   1  │         │
│       50  │ p     │   2  │       1 │
│       60  │ p     │   1  │         │
│       70  │ v     │   5  │         │
│       80  │ v     │   6  │         │
│       90  │ v     │   6  │         │
│      100  │ v     │   3  │       1 │
│      110  │       │   3  │         │
│      120  │ v     │   6  │         │
│      130  │       │   3  │         │
│      140  │ p     │  10  │       1 │
│      150  │ p     │   8  │         │
└───────────┴───────┴──────┴─────────┘

Version with window clause:

SELECT *, CASE WHEN type = 'p' AND val > LAG(val) OVER s
                AND val > LEAD(val) OVER s THEN 1 
               WHEN type = 'v' AND val < LAG(val) OVER s
                AND val < LEAD(val) OVER s THEN 1 
          END AS is_peak
FROM vals
WINDOW s AS (PARTITION BY type ORDER BY timestamp)
ORDER BY timestamp;

db<>fiddle demo2

EDIT

I think with a hopefully small change we can get timestamp 120 also, then that'll be it

SELECT *,CASE
  WHEN type IN ('p','v') AND val > LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
  AND val > LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
  WHEN type IN ('v') AND val < LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp)
  AND val < LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
 END AS is_peak
FROM vals
ORDER BY timestamp;

db<>fiddle demo3


EDIT 2:

Final solution with gaps-and-islands detection(handling plateau):

WITH cte AS (
  SELECT *, LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) AS l
  FROM vals
), cte2 AS (
  SELECT *, SUM(CASE WHEN val = l THEN 1 ELSE 0 END) OVER(PARTITION BY type ORDER BY timestamp) AS dr
  FROM cte
), cte3 AS (
  SELECT *, CASE WHEN type IN ('p') AND val > LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
                AND val >= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
               WHEN type IN ('v') AND val < LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
                AND val <= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
          END AS is_peak
  FROM cte2
)
SELECT timestamp, type, val,
     CASE WHEN is_peak = 1 THEN 1 
          WHEN EXISTS (SELECT 1 FROM cte3 cx
                       WHERE cx.is_peak = 1
                         AND cx.val = cte3.val
                         AND cx.type = cte3.type
                         AND cx.dr = cte3.dr)
              THEN 1
     END is_peak
FROM cte3
ORDER BY timestamp;

db<>fiddle demo final

Output:

┌────────────┬───────┬──────┬─────────┐
│ timestamp  │ type  │ val  │ is_peak │
├────────────┼───────┼──────┼─────────┤
│        10  │       │   1  │         │
│        20  │       │   2  │         │
│        39  │       │   1  │         │
│        40  │ p     │   1  │         │
│        50  │ p     │   2  │       1 │
│        60  │ p     │   1  │         │
│        70  │ v     │   5  │         │
│        80  │ v     │   6  │         │
│        90  │ v     │   6  │         │
│       100  │ v     │   3  │       1 │
│       110  │       │   3  │         │
│       120  │ v     │   6  │         │
│       130  │       │   3  │         │
│       140  │ p     │  10  │       1 │
│       150  │ p     │   8  │         │
│       160  │       │   3  │         │
│       170  │ p     │   1  │         │
│       180  │ p     │   2  │       1 │
│       190  │ p     │   2  │       1 │
│       200  │ p     │   1  │         │
│       210  │       │   3  │         │
│       220  │ v     │   1  │       1 │
│       230  │ v     │   1  │       1 │
│       240  │ v     │   3  │         │
│       250  │ v     │  41  │         │
└────────────┴───────┴──────┴─────────┘

Additional note:

ISO SQL:2016 adds pattern matching MATCH_RECOGNIZE for this kind of scenarios where you define regexp for peak like PATTERN (STRT UP+ FLAT* DOWN+) but currently it is supported only by Oracle.

Related article: Modern SQL - match_recognize Regular Expressions Over Rows

Upvotes: 3

3N1GM4
3N1GM4

Reputation: 3351

You could use subqueries within a case statement to achieve this:

create table #vals 
(
    [timestamp] int,
    [type] varchar(25),
    val int
);

insert into #vals ([timestamp], [type], val) 
values  (10, null, 1),
        (20, null, 2),
        (30, null, 1),
        (40,'p',1),
        (50,'p',2),
        (60,'p',1),
        (70,'v',5),
        (80,'v',6),
        (90,'v',6),
        (100,'v',3),
        (110,null,3)

select 
    r.*,
    case 
        when r.[type] = 'p' and not exists (select * from #vals c where c.[type] = r.[type] and c.val > r.val) then 1
        when r.[type] = 'v' and not exists (select * from #vals c where c.[type] = r.[type] and c.val < r.val) then 1
        else null
    end as is_peak
from #vals r

drop table #vals

Results:

/----------------------------------\
| timestamp | type | val | is_peak |
|-----------|------|-----|---------|
| 10        | NULL | 1   | NULL    |
| 20        | NULL | 2   | NULL    |
| 30        | NULL | 1   | NULL    |
| 40        | p    | 1   | NULL    |
| 50        | p    | 2   | 1       |
| 60        | p    | 1   | NULL    |
| 70        | v    | 5   | NULL    |
| 80        | v    | 6   | NULL    |
| 90        | v    | 6   | NULL    |
| 100       | v    | 3   | 1       |
| 110       | NULL | 3   | NULL    |
\----------------------------------/

Note: If there are multiple records with the same (peak) val, they would each be marked with a 1 in the is_peak column.

Upvotes: 1

Related Questions