Reputation: 2422
I have a table to consolidate stats for different types from my other tables:
Table name: my_stats
lob category parameter total_count timestamp day_difference
DSS Industry Advertising & Marketing 310057 2020-04-21 07:35:14.237987
DSS Function Administration 357351 2020-04-21 11:06:27.009658
DSS Country czechia 321 2020-04-21 11:12:55.731648
DSS Records per domain apple.com 65 2020-04-21 11:13:17.855059
DSS Records per domain Records per domain 5 2020-04-21 11:13:17.85510
DSS Industry Advertising & Marketing 310059 2020-04-21 10:36:14.237987
DSS Function Administration 357353 2020-04-21 14:08:26.009658
DSS Country czechia 324 2020-04-21 14:11:55.731648
DSS Records per domain apple.com 60 2020-04-21 14:08:17.855059
DSS Records per domain Records per domain 5 2020-04-21 14:14:17.85510
DSS Industry Advertising & Marketing 310058 2020-04-22 08:35:14.237987
DSS Function Administration 357312 2020-04-22 11:05:27.009658
DSS Country czechia 201 2020-04-22 11:13:55.731648
DSS Records per domain apple.com 55 2020-04-22 11:14:17.855059
DSS Records per domain Records per domain 2 2020-04-22 11:15:17.85510
my_stats
gets updated every 3 hours. So, new entries are added every 3 hours. I have to find the day_difference
value.
The day_difference
value is (count - count of subsequent row closest 24 hours prior).
The output of the table should be:
lob category parameter total_count timestamp day_difference
DSS Industry Advertising & Marketing 310057 2020-04-21 07:35:14.237987 NULL
DSS Function Administration 357351 2020-04-21 11:06:27.009658 NULL
DSS Country czechia 321 2020-04-21 11:12:55.731648 NULL
DSS Records per domain apple.com 65 2020-04-21 11:13:17.855059 NULL
DSS Records per domain Records per domain 5 2020-04-21 11:13:17.85510 NULL
DSS Industry Advertising & Marketing 310059 2020-04-21 10:36:14.237987 NULL
DSS Function Administration 357353 2020-04-21 14:08:26.009658 NULL
DSS Country czechia 324 2020-04-21 14:11:55.731648 NULL
DSS Records per domain apple.com 60 2020-04-21 14:08:17.855059 NULL
DSS Records per domain Records per domain 5 2020-04-21 14:14:17.85510 NULL
DSS Industry Advertising & Marketing 310058 2020-04-22 08:35:14.237987 1
DSS Function Administration 357312 2020-04-22 11:05:27.009658 NULL
DSS Country czechia 201 2020-04-22 11:13:55.731648 -120
DSS Records per domain apple.com 55 2020-04-22 11:14:17.855059 -10
DSS Records per domain Records per domain 2 2020-04-22 11:15:17.85510 -3
If for the row, subsequent row prior to 24 hours is not available, then keep the day_difference = NULL.
Another corner case to be considered is the difference should be with the CLOSEST prior 24 hours difference.
Is there a way I can get this result in SQL?
Upvotes: 2
Views: 144
Reputation: 621
You can use RANK to identify the first record in the 24-hour period, then just a simple subtraction. I'm assuming you're using Amazon RedShift SQL here. If you are not, the NVL
syntax is replace with ISNULL
in MS SQL.
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html
SELECT
r.lob
,r.category
,r.parameter
,r.total_count
,r.timestamp
,i.day_difference
FROM dbo.my_stats r
LEFT JOIN (
SELECT
x.lob
,x.category
,x.parameter
,x.total_count
,x.timestamp
,RANK() OVER (PARTITION BY p.lob,p.category,p.parameter ORDER BY p.timestamp) AS rank_order
,x.total_count - p.total_count AS day_difference
FROM dbo.my_stats x
INNER JOIN dbo.my_stats p --> lookback period
ON p.lob = x.lob
AND p.category = x.category
AND p.parameter = x.parameter
AND p.timestamp < x.timestamp
AND p.timestamp > DATEADD(HOUR,-24,x.timestamp)
) i
ON i.lob = r.lob
AND i.category = r.category
AND i.parameter = r.parameter
AND i.timestamp = r.timestamp
WHERE NVL(i.rank_order,1)=1
Upvotes: 1
Reputation: 222702
This would have been a good spot to use a window function such as lag()
with a range specification. Alas, Redshift only supports rows
in the frame clause to window functions.
Here is an alternative that uses a correlated subquery:
select
s.*,
total_count - (
select total_count
from my_stats s1
where
s1.lob = s.lob
and s1.category = s.category
and s1.parameter = s.parameter
and s1.timestamp < s.timestamp - interval '1 day'
order by s1.timestamp desc
limit 1
) day_diff
from my_stats s
For performance, you do want an index on (lob, category, parameter, timestamp, total_count)
.
Upvotes: 3