MatBailie
MatBailie

Reputation: 86775

(REDSHIFT) Vertical Coalesce / FIRST_VALUE() as an Aggregate

(This is Redshift specific and should account for it's columnar nature, sort order, etc.)

I need to get the first non-NULL value from each column, by category, when sorted by a timestamp.

Essentially, the same as FIRST_VALUE() does, but as an aggregate.

Alternatively, a COALESCE() as an aggregate.

Redshift, however, doesn't have the niceties of later version of PostgreSQL or Oracle. So, I'm seeking options to test on my 100 million row imports :)

(I don't like either of my options, but I'm stumped for better ones.)


Sample Input

 category | row_timestamp | value_a | value_b | value_c
----------+---------------+---------+---------+---------

    01    |      001      |   NULL  |   NULL  |     4
    01    |      010      |      7  |   NULL  |  NULL
    01    |      100      |   NULL  |      1  |     2
    01    |      999      |      6  |      3  |     6

    02    |      001      |      1  |   NULL  |  NULL
    02    |      010      |   NULL  |      2  |  NULL
    02    |      100      |   NULL  |      1  |     9
    02    |      999      |      6  |      3  |     2

Expected Results

 category |                 value_a | value_b | value_c
----------+-------------------------+---------+---------
    01    |                      7  |      1  |     4
    02    |                      1  |      2  |     9


Current solution

SELECT DISTINCT
    category,
    FIRST_VALUE(value_a IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_a,

    FIRST_VALUE(value_b IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_b,

    FIRST_VALUE(value_c IGNORE NULLS)
        OVER (PARTITION BY category
                  ORDER BY row_timestamp
              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
             )
                 AS value_c
FROM
    mytable

It works, but the DISTINCT might apply over hundreds or thousands of rows. Less than ideal.

If it was only for one or two columns, this might work (but it's for a dozen columns, so it's horrible)...

WITH
    sorted_value_a AS
(
    SELECT
        category,
        value_a,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_a IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
),
    sorted_value_b AS
(
    SELECT
        category,
        value_b,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_b IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
),
    sorted_value_c AS
(
    SELECT
        category,
        value_c,
        ROW_NUMBER() OVER (PARTITION BY category
                               ORDER BY value_c IS NOT NULL, row_timestamp
                          )
                              AS row_ordinal
    FROM
        myTable
)
SELECT
    *
FROM
    sorted_value_a   AS a
INNER JOIN
    sorted_value_b   AS b
        ON b.category = a.category
INNER JOIN
    sorted_value_c   AS c
        ON c.category = a.category

Upvotes: 2

Views: 2160

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270713

Well, I don't know if this is as aesthetically pleasing, but you could do:

select category, value_a, value_b, value_c, value_d
from (select coalesce(value_a, lag(value_a ignore nulls) over (partition by category order by row_timestamp)) as value_a,
             coalesce(value_b, lag(value_b ignore nulls) over (partition by category order by row_timestamp)) as value_b,
             coalesce(value_c, lag(value_c ignore nulls) over (partition by category order by row_timestamp)) as value_c,
             coalesce(value_d, lag(value_d ignore nulls) over (partition by category order by row_timestamp)) as value_d
             row_number() over (partition by category order by row_timestamp desc) as seqnum 
      from mytable t
     ) t
where seqnum = 1;

Upvotes: 2

Related Questions