R3uK
R3uK

Reputation: 14537

Using nested window function in Snowflake

I've seen a lot of questions about this general error, but I don't get why I have it, maybe because of nested window functions...


With the below query, I get the error for Col_C, Col_D, ... and almost everything I tried

SQL compilation error: [eachColumn] is not a valid group by expression

SELECT
    Col_A,
    Col_B,
    FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    FIRST_VALUE(CASE WHEN Col_T = 'testvalue'
                THEN LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                                    ORDER BY Col_TimeStamp DESC 
                                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
                ELSE NULL END) IGNORE NULLS 
                                    OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM mytable

So, is there a way to used nested window functions in Snowflake (with case when ...) and if so, how/what am I doing wrong ?

Upvotes: 3

Views: 3012

Answers (2)

Lukasz Szozda
Lukasz Szozda

Reputation: 175726

"I've seen a lot of questions about this general error, but I don't get why I have it, maybe because of nested window functions..."

Snowflake supports reusing expressions at the same level(sometimes called "lateral column alias reference" )

It is perfectly fine to write:

SELECT 1+1 AS col1,
       col1 *2 AS col2,
       CASE WHEN col1 > col2 THEN 'Y' ELSE 'NO' AS col3
       ...

In standard SQL you will either have to use multiple levels of query(cte) or use LATERAL JOIN. Related: PostgreSQL: using a calculated column in the same query


Unfortunately the same syntax will not work for analytic functions(and I am now aware of any RDMBS that supports it):

SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
      ,MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM tab;

In the SQL Standard 2016 there is optional feature: T619 Nested window functions.

Here an article how the nested analytic function query could look like: Nested window functions in SQL.

It means that current way to nest windowed function is usage of derived table/cte:

WITH cte AS (
    SELECT ROW_NUMBER() OVER(PARTITION BY ... ORDER BY ...) AS rn
           ,*
    FROM tab
)
SELECT  *, MAX(rn) OVER(PARTITION BY <different than prev) AS m
FROM cte

Upvotes: 1

Simeon Pilgrim
Simeon Pilgrim

Reputation: 25968

So deconstructing your logic to show it's the second FIRST_VALUE that causes the problem

WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
    SELECT * FROM VALUES
        (1,1,1,1,1,'testvalue',10),
        (1,1,2,3,2,'value',11)
)
SELECT
    Col_A,
    Col_B,
    FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
    MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp ASC
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
    LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                    ORDER BY Col_TimeStamp DESC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,   
    IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
    /*,FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as the_problem*/
FROM data
ORDER BY Col_A,Col_B, col_timestamp
;

if we uncomment the_problem we have it.. compare to PostgreSQL (my background) just getting to reuse so many prior results/steps is a gift, so here I just bust out another SELECT layer.

WITH data(Col_A,Col_B,Col_c,col_d, Col_TimeStamp, col_t,col_e) AS (
    SELECT * FROM VALUES
        (1,1,1,1,1,'testvalue',10),
        (1,1,2,3,2,'value',11)
)
SELECT *,
    FIRST_VALUE(if_test_last_e) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                    ORDER BY Col_TimeStamp ASC 
                                    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as not_a_problem
FROM (
    SELECT
        Col_A,
        Col_B,
        FIRST_VALUE(Col_C) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B 
                                        ORDER BY Col_TimeStamp ASC 
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_c,
        MAX(Col_D)                      OVER (PARTITION BY Col_A, Col_B
                                        ORDER BY Col_TimeStamp ASC
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
        LAST_VALUE(Col_E) IGNORE NULLS OVER (PARTITION BY Col_A, Col_B
                                        ORDER BY Col_TimeStamp DESC 
                                        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last_e,   
        IFF(Col_T = 'testvalue', last_e, NULL) as if_test_last_e
        ,Col_TimeStamp
    FROM data
)
ORDER BY Col_A,Col_B, Col_TimeStamp

And then it all works. This also happens if you LAG then IFF/FIRST_VALUE and then LAG that second result.

Upvotes: 2

Related Questions