Reputation: 21
I have a table that looks something like this:
+-----+-----+--------------------------+--------+------+-----------+
| uid | sid | url | name | age | timestamp |
+-----+-----+--------------------------+--------+------+-----------+
| 1 | a | null | Fred | null | 123 |
| 1 | a | null | null | 23 | 124 |
| 1 | a | google.com/index.html | null | null | 125 |
| 1 | a | null | Freddie| null | 126 |
| 1 | a | null | null | 23 | 127 |
| 1 | a | null | null | 24 | 128 |
| 1 | a | null | Freddy | null | 129 |
| 1 | a | null | null | 22 | 130 |
| 1 | a | google.com/features.html | null | null | 131 |
| 1 | a | null | Freddy | null | 132 |
| 1 | a | null | null | 24 | 133 |
| 1 | b | google.com/index.html | null | null | 134 |
| 1 | b | null | Freddy | null | 135 |
| 1 | b | null | null | 24 | 136 |
+-----+-----+--------------------------+--------+------+-----------+
And I want to transform it to:
+-----+-----+--------------------------+--------+-----+-----------+
| uid | sid | url | name | age | timestamp |
+-----+-----+--------------------------+--------+-----+-----------+
| 1 | a | null | Fred | 23 | 123 |
| 1 | a | google.com/index.html | Freddy | 24 | 125 |
| 1 | a | google.com/features.html | Freddy | 24 | 131 |
| 1 | b | google.com/index.html | Freddy | 24 | 134 |
+-----+-----+--------------------------+--------+-----+-----------+
Essentially I am combining records that share the same uid and sid, but also squashing records that follow a url change. I am wanting to keep the last value of name and the max value of age.
Is this possible in SQL/Redshift? Ive looked into 'GROUP BY' and/or windowing to achieve this but cant seem to get the desired result. If possible can someone please describe the query that makes it so.
Upvotes: 1
Views: 235
Reputation: 21
Ok after tinkering a little with @GordonLinoff answer I got my desired result. The query looks like this:
SELECT
DISTINCT group_id,
last_value(uid IGNORE NULLS) OVER (
PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS uid,
last_value(sid IGNORE NULLS) OVER (
PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS sid,
last_value(url IGNORE NULLS) OVER (
PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS url,
last_value(name IGNORE NULLS) OVER (
PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS name,
max(age IGNORE NULLS) OVER (
PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS age,
min(timestamp IGNORE NULLS) OVER (
PARTITION BY group_id ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS timestamp
FROM (
SELECT
t.*,
CONCAT(
COUNT(url) OVER (
PARTITION BY uid, sid ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
),
CONCAT(uid, sid)
) AS group_id
FROM t
) AS t_grouped
ORDER BY timestamp ASC
Upvotes: 1
Reputation: 1269443
I think you can define the groups with a cumulative count of the values in url
. However, then you need a way to get the last non-NULL
value in each group, for which you can use last_value()
or first_value()
:
select distinct
max(url) over (partition by uid, sid, grp),
first_value(name ignore nulls) over (partition by uid, sid, grp order by timestamp desc rows between unbounded preceding and current row) as name,
first_value(age ignore nulls) over (partition by uid, sid, grp order by timestamp desc rows between unbounded preceding and current row) as age
from (select t.*,
count(url) over (partition by uid, sid order by timestamp) as grp
from t
) t
Upvotes: 0