How to get the first element in groups over a window in SQL?

Question

I want to get the first elements of groups, but the groups must be calculated for each window. I would like to do something like this:

SCHEMA

TABLE
id
target
groups
capture_date
event_date

SELECT
    AVG(
        FIRST(target) GROUP BY id ORDER BY capture_date DESC WHERE capture_date <= MAX(event_date)
    ) OVER (
        PARTITION BY groups
        ORDER BY event_date
        RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
    )
FROM table

I want to do this in sql or pyspark, whatever in simpler. Any ideas? Thank you!

Matt Andruff · Accepted Answer

Here's a all SQL version but can be re-written in spark/pyspark if needed. I used groupby but you could also run a second window with row_number & where

with raw_averages as ( -- short cut for subquery
 SELECT
    AVG(
        target
    ) OVER (
        PARTITION BY groups
        ORDER BY event_date
        RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
    ) as average, 
    ID, 
    capture_date,
    event_date
FROM table ),
grouped_result as -- more shortcut for subquery
(SELECT 
 id, 
 avg(average) as average, -- the average of the entire group is the same -->math trick
 reverse( -- sort descending
  array_sort( --sort ascending by first item ( event_date )
   arrays_zip( -- create one array of below arrays
    collect_list( event_date ), -- collect the grouped items *has to be first to get the ordering you want*
    collect_list( capture_date ) -- collect the grouped items
   )
  )
 )[0] as values --getting first will return max (reference first item in array)
from raw_averages 
GROUP BY groups, id 
-- HAVING values.`0` = values.`1` -- having might work here but I didn't explore it
)
select 
 groups, 
 id, 
 average,
 values.`0` as event_date -- awkward syntax because of arrays_zip
 values.`1` as capture_date
from 
 grouped_result
where values.`0` = values.`1`

How to get the first element in groups over a window in SQL?

Answers (1)

Related Questions