dang
dang

Reputation: 2422

Day difference of counts from prior 24 hours

I have a table to consolidate stats for different types from my other tables:

Table name: my_stats

lob         category            parameter                   total_count     timestamp                       day_difference
DSS         Industry            Advertising & Marketing     310057          2020-04-21 07:35:14.237987
DSS         Function            Administration              357351          2020-04-21 11:06:27.009658
DSS         Country             czechia                     321             2020-04-21 11:12:55.731648
DSS         Records per domain  apple.com                   65              2020-04-21 11:13:17.855059
DSS         Records per domain  Records per domain          5               2020-04-21 11:13:17.85510

DSS         Industry            Advertising & Marketing     310059          2020-04-21 10:36:14.237987
DSS         Function            Administration              357353          2020-04-21 14:08:26.009658
DSS         Country             czechia                     324             2020-04-21 14:11:55.731648
DSS         Records per domain  apple.com                   60              2020-04-21 14:08:17.855059
DSS         Records per domain  Records per domain          5               2020-04-21 14:14:17.85510

DSS         Industry            Advertising & Marketing     310058          2020-04-22 08:35:14.237987
DSS         Function            Administration              357312          2020-04-22 11:05:27.009658
DSS         Country             czechia                     201             2020-04-22 11:13:55.731648
DSS         Records per domain  apple.com                   55              2020-04-22 11:14:17.855059
DSS         Records per domain  Records per domain          2               2020-04-22 11:15:17.85510

my_stats gets updated every 3 hours. So, new entries are added every 3 hours. I have to find the day_difference value.

The day_difference value is (count - count of subsequent row closest 24 hours prior).

The output of the table should be:

lob         category            parameter                   total_count     timestamp                       day_difference
DSS         Industry            Advertising & Marketing     310057          2020-04-21 07:35:14.237987      NULL
DSS         Function            Administration              357351          2020-04-21 11:06:27.009658      NULL
DSS         Country             czechia                     321             2020-04-21 11:12:55.731648      NULL
DSS         Records per domain  apple.com                   65              2020-04-21 11:13:17.855059      NULL
DSS         Records per domain  Records per domain          5               2020-04-21 11:13:17.85510       NULL

DSS         Industry            Advertising & Marketing     310059          2020-04-21 10:36:14.237987      NULL
DSS         Function            Administration              357353          2020-04-21 14:08:26.009658      NULL
DSS         Country             czechia                     324             2020-04-21 14:11:55.731648      NULL
DSS         Records per domain  apple.com                   60              2020-04-21 14:08:17.855059      NULL
DSS         Records per domain  Records per domain          5               2020-04-21 14:14:17.85510       NULL

DSS         Industry            Advertising & Marketing     310058          2020-04-22 08:35:14.237987      1
DSS         Function            Administration              357312          2020-04-22 11:05:27.009658      NULL
DSS         Country             czechia                     201             2020-04-22 11:13:55.731648      -120
DSS         Records per domain  apple.com                   55              2020-04-22 11:14:17.855059      -10
DSS         Records per domain  Records per domain          2               2020-04-22 11:15:17.85510       -3

If for the row, subsequent row prior to 24 hours is not available, then keep the day_difference = NULL.

Another corner case to be considered is the difference should be with the CLOSEST prior 24 hours difference.

Is there a way I can get this result in SQL?

Upvotes: 2

Views: 144

Answers (2)

Dave Skender
Dave Skender

Reputation: 621

You can use RANK to identify the first record in the 24-hour period, then just a simple subtraction. I'm assuming you're using Amazon RedShift SQL here. If you are not, the NVL syntax is replace with ISNULL in MS SQL.

https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html

SELECT
  r.lob
 ,r.category
 ,r.parameter
 ,r.total_count
 ,r.timestamp
 ,i.day_difference
FROM dbo.my_stats r
LEFT JOIN (

  SELECT
    x.lob
   ,x.category
   ,x.parameter
   ,x.total_count
   ,x.timestamp
   ,RANK() OVER (PARTITION BY p.lob,p.category,p.parameter ORDER BY p.timestamp) AS rank_order
   ,x.total_count - p.total_count AS day_difference
  FROM dbo.my_stats x
  INNER JOIN dbo.my_stats p  --> lookback period
    ON p.lob       = x.lob
   AND p.category  = x.category
   AND p.parameter = x.parameter
   AND p.timestamp < x.timestamp
   AND p.timestamp > DATEADD(HOUR,-24,x.timestamp)

) i
 ON i.lob       = r.lob
AND i.category  = r.category
AND i.parameter = r.parameter
AND i.timestamp = r.timestamp
WHERE NVL(i.rank_order,1)=1

Upvotes: 1

GMB
GMB

Reputation: 222702

This would have been a good spot to use a window function such as lag() with a range specification. Alas, Redshift only supports rows in the frame clause to window functions.

Here is an alternative that uses a correlated subquery:

select
    s.*,
    total_count - (
        select total_count
        from my_stats s1
        where 
            s1.lob = s.lob
            and s1.category = s.category
            and s1.parameter = s.parameter
            and s1.timestamp < s.timestamp - interval '1 day'
        order by s1.timestamp desc
        limit 1
    ) day_diff
from my_stats s

For performance, you do want an index on (lob, category, parameter, timestamp, total_count).

Upvotes: 3

Related Questions