Frazer Bayley
Frazer Bayley

Reputation: 21

Merging Rows in Redshift Choosing the Last Value

I have a table that looks something like this:

+-----+-----+--------------------------+--------+------+-----------+
| uid | sid |           url            |  name  | age  | timestamp |
+-----+-----+--------------------------+--------+------+-----------+
|   1 | a   | null                     | Fred   | null |       123 |
|   1 | a   | null                     | null   | 23   |       124 |
|   1 | a   | google.com/index.html    | null   | null |       125 |
|   1 | a   | null                     | Freddie| null |       126 |
|   1 | a   | null                     | null   | 23   |       127 |
|   1 | a   | null                     | null   | 24   |       128 |
|   1 | a   | null                     | Freddy | null |       129 |
|   1 | a   | null                     | null   | 22   |       130 |
|   1 | a   | google.com/features.html | null   | null |       131 |
|   1 | a   | null                     | Freddy | null |       132 |
|   1 | a   | null                     | null   | 24   |       133 |
|   1 | b   | google.com/index.html    | null   | null |       134 |
|   1 | b   | null                     | Freddy | null |       135 |
|   1 | b   | null                     | null   | 24   |       136 |
+-----+-----+--------------------------+--------+------+-----------+

And I want to transform it to:

+-----+-----+--------------------------+--------+-----+-----------+
| uid | sid |           url            |  name  | age | timestamp |
+-----+-----+--------------------------+--------+-----+-----------+
|   1 | a   | null                     | Fred   |  23 |       123 |
|   1 | a   | google.com/index.html    | Freddy |  24 |       125 |
|   1 | a   | google.com/features.html | Freddy |  24 |       131 |
|   1 | b   | google.com/index.html    | Freddy |  24 |       134 |
+-----+-----+--------------------------+--------+-----+-----------+

Essentially I am combining records that share the same uid and sid, but also squashing records that follow a url change. I am wanting to keep the last value of name and the max value of age.

Is this possible in SQL/Redshift? Ive looked into 'GROUP BY' and/or windowing to achieve this but cant seem to get the desired result. If possible can someone please describe the query that makes it so.

Upvotes: 1

Views: 235

Answers (2)

Frazer Bayley
Frazer Bayley

Reputation: 21

Ok after tinkering a little with @GordonLinoff answer I got my desired result. The query looks like this:

SELECT
    DISTINCT group_id, 
    last_value(uid IGNORE NULLS) OVER (
        PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
    ) AS uid, 
    last_value(sid IGNORE NULLS) OVER (
        PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
    ) AS sid, 
    last_value(url IGNORE NULLS) OVER (
        PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
    ) AS url, 
    last_value(name IGNORE NULLS) OVER (
        PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
    ) AS name, 
    max(age IGNORE NULLS) OVER (
        PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
    ) AS age, 
    min(timestamp IGNORE NULLS) OVER (
        PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
    ) AS timestamp
FROM (
    SELECT
        t.*,
        CONCAT(
            COUNT(url) OVER (
                PARTITION BY uid, sid ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ),
            CONCAT(uid, sid)
        ) AS group_id 
    FROM t
) AS t_grouped 
ORDER BY timestamp ASC

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269443

I think you can define the groups with a cumulative count of the values in url. However, then you need a way to get the last non-NULL value in each group, for which you can use last_value() or first_value():

select distinct
       max(url) over (partition by uid, sid, grp),
       first_value(name ignore nulls) over (partition by uid, sid, grp order by timestamp desc rows between unbounded preceding and current row) as name,
       first_value(age ignore nulls) over (partition by uid, sid, grp order by timestamp desc rows between unbounded preceding and current row) as age
from (select t.*,
             count(url) over (partition by uid, sid order by timestamp) as grp
      from t
     ) t

Upvotes: 0

Related Questions