Reputation: 1274
I have a bunch of orders. Each order is either a type A or type B order. I want a 3-month moving average of time it takes to ship orders of each type. How can I aggregate this order data into what I want using Redshift or Postgres SQL?
Start with this:
order_id | order_type | ship_date | time_to_ship |
---|---|---|---|
1 | a | 2021-12-25 | 100 |
2 | b | 2021-12-31 | 110 |
3 | a | 2022-01-01 | 200 |
4 | a | 2022-01-01 | 50 |
5 | b | 2022-01-15 | 110 |
6 | a | 2022-02-02 | 100 |
7 | a | 2022-02-28 | 300 |
8 | b | 2022-04-05 | 75 |
9 | b | 2022-04-06 | 210 |
10 | a | 2022-04-15 | 150 |
Note: Some months have no shipments. The solution should allow for this.
I want this:
order_type | ship__month | mma3_time_to_ship |
---|---|---|
a | 2022-02-01 | 150 |
a | 2022-04-01 | 160 |
b | 2022-04-01 | 126.25 |
Where a 3-month moving average is only calculated for months with at least 2 preceding months. Each record is an order type-month. The ship_month
columns denotes the month of shipment (Redshift represents months as the date of the first of the month).
Here's how the mma3_time_to_ship
column is calculated, expressed as Excel-like formulas:
150 = AVERAGE(100, 200, 50, 100, 300)
<- The average for all A orders in Dec, Jan, and Feb.
160 = AVERAGE(200, 50, 100, 300, 150)
<- The average for all A orders in Jan, Feb, Apr (no orders in March)
126.25 = AVERAGE(110, 110, 75, 210)
<- The average for all B orders in Dec, Jan, Apr (no B orders in Feb, no orders at all in Mar)
My attempt doesn't aggregate it into monthly data and 3-month averages (this query runs without error in Redshift):
SELECT
order_type,
DATE_TRUNC('month', ship_date) AS ship_month,
AVG(time_to_ship) OVER (
PARTITION BY
order_type,
ship_month
ORDER BY ship_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS avg_time_to_ship
FROM tbl
Is what I want possible?
Upvotes: 1
Views: 717
Reputation: 16397
This is honestly a complete stab in the dark, so it won't surprise me if it's not correct... but it seems to me you can accomplish this with a self join using a range of dates within the join.
select
t1.order_type, t1.ship_date, avg (t2.time_to_ship) as 3mma_time_to_ship
from
tbl t1
join tbl t2 on
t1.order_type = t2.order_type and
t2.ship_date between t1.ship_date - interval '3 months' and t1.ship_date
group by
t1.order_type, t1.ship_date
The results don't match your example, but then I'm not entirely sure where they came from anyway.
Perhaps this will be the catalyst towards an eventual solution or at least an idea to start.
This is Pg12, by the way. Not sure if it will work on Redshift.
-- EDIT --
Per your updates, I was able to match your three results identically. I used dense_rank to find the closest three months:
with foo as (
select
order_type, date_trunc ('month', ship_date)::date as ship_month,
time_to_ship, dense_rank() over (partition by order_type order by date_trunc ('month', ship_date)) as dr
from tbl
)
select
f1.order_type, f1.ship_month,
avg (f2.time_to_ship),
array_agg (f2.time_to_ship)
from
foo f1
join foo f2 on
f1.order_type = f2.order_type and
f2.dr between f1.dr - 2 and f1.dr
group by
f1.order_type, f1.ship_month
Results:
b 2022-01-01 110.0000000000000000 {110,110}
a 2022-01-01 116.6666666666666667 {100,50,200,100,50,200}
b 2022-04-01 126.2500000000000000 {110,110,75,210,110,110,75,210}
b 2021-12-01 110.0000000000000000 {110}
a 2021-12-01 100.0000000000000000 {100}
a 2022-02-01 150.0000000000000000 {100,50,200,100,300,100,50,200,100,300}
a 2022-04-01 160.0000000000000000 {50,200,100,300,150}
There are some dupes in the array elements, but it doesn't seem to impact the averages. I'm sure that part could be fixed.
Upvotes: 1