Reputation: 11
COVID-19 Data obtained from OurWorldInData includes total_boosters column to track vaccine boosters in any given country. My goal is to create a column for daily new boosters (boosters_new), then create a 90 day rolling sum (boosters_90roll). The problem is total_boosters has a lot of missing data, so I would like to linearly interpolate the missing data. If the missing data is not filled in any way, then a rolling sum would be a poor estimate and would be visualized poorly.
I already have a working solution to this, but it is the trickiest query I've worked on. I am posting this question to learn if there are cleaner/more efficient ways of doing this.
Sample input:
location | date | total_boosters |
---|---|---|
Albania | 2022-03-07 | 245527 |
Albania | 2022-03-08 | |
Albania | 2022-03-09 | |
Albania | 2022-03-10 | 248491 |
Albania | 2022-03-11 | |
Albania | 2022-03-12 | |
Albania | 2022-03-13 | |
Albania | 2022-03-14 | 251024 |
Albania | 2022-03-15 | 252161 |
Sample output:
location | date | boosters_new | boosters_90roll |
---|---|---|---|
Albania | 2022-03-07 | 988 | 988 |
Albania | 2022-03-08 | 988 | 1,976 |
Albania | 2022-03-09 | 988 | 2,964 |
Albania | 2022-03-10 | 633.25 | 3,597.25 |
Albania | 2022-03-11 | 633.25 | 4,230.5 |
Albania | 2022-03-12 | 633.25 | 4,863.75 |
Albania | 2022-03-13 | 633.25 | 5,497 |
Albania | 2022-03-14 | 1,137 | 6,634 |
Albania | 2022-03-15 | 886.8889 | 7,520.8889 |
Current solution
-- Group total_boosters and sequential nulls as boosters_group, then counts and numbers the rows in each group
WITH BoostersGrouped AS
(
SELECT
location,
date,
total_boosters,
boosters_group,
COUNT(boosters_group) OVER (
PARTITION BY location, boosters_group
) AS group_count,
ROW_NUMBER() OVER (
PARTITION BY location, boosters_group
) AS group_nrow
FROM
(
SELECT
location,
date,
total_boosters,
COUNT(total_boosters) OVER (
PARTITION BY location ORDER BY date
) AS boosters_group
FROM Vaccinations
) grouped
),
-- Self join to get total_boosters for current and next booster_group
BoostersFilled AS
(
SELECT
bg1.location,
bg1.date,
bg1.boosters_group,
bg1.group_count,
bg1.group_nrow,
bg2.boosters_current,
bg3.boosters_next
FROM BoostersGrouped bg1
LEFT JOIN
(
SELECT
location,
boosters_group,
max(total_boosters) AS boosters_current
FROM BoostersGrouped
GROUP BY location, boosters_group
) bg2
ON bg1.location = bg2.location
AND bg1.boosters_group = bg2.boosters_group
LEFT JOIN
(
SELECT
location,
boosters_group,
max(total_boosters) AS boosters_next
FROM BoostersGrouped
GROUP BY location, boosters_group
) bg3
ON bg1.location = bg3.location
AND (bg1.boosters_group + 1) = bg3.boosters_group
),
-- Linearly interpolate total_boosters
BoostersInterp AS
(
SELECT
location,
date,
CASE
WHEN boosters_group = 0 THEN 0
ELSE (boosters_current + ((boosters_next - boosters_current)* group_nrow / group_count))
END AS boosters_interp
FROM BoostersFilled
),
-- Calculate daily boosters_new
BoostersNew AS
(
SELECT
location,
date,
(boosters_interp - boosters_lag) AS boosters_new
FROM
(
SELECT
location,
date,
boosters_interp,
LAG(boosters_interp) OVER (
PARTITION BY location ORDER BY date
) AS boosters_lag
FROM BoostersInterp
) BoostersLag
)
-- Calculate 90 day rolling sum
SELECT
location,
date,
boosters_new,
SUM(boosters_new) OVER (
PARTITION BY location ORDER BY date ROWS BETWEEN 89 PRECEDING AND CURRENT ROW
) AS boosters_90roll
FROM BoostersNew;
Upvotes: 1
Views: 177
Reputation: 222632
That's an interesting question!
I suspect that logic can be simplified (and possibly become more efficient?) by making more use of window functions.
Here is a query that does the linear regression directly on the total_boosters
metric.
select location, date,
coalesce(
total_boosters,
total_boosters1 + (total_boosters2 - total_boosters1) * (rn - 1) / cnt
) total_boosters
from (
select v.*,
min(total_boosters) over(partition by location, grp1) total_boosters1,
min(total_boosters) over(partition by location, grp2) total_boosters2,
count(*) over(partition by location, grp1) cnt,
row_number() over(partition by location, grp1 order by date) rn
from (
select v.*,
count(total_boosters) over(partition by location order by date) grp1,
count(total_boosters) over(partition by location order by date desc) grp2
from vaccinations v
) v
) v
The logic is to define groups of rows that either start (grp1
) or end (grp2
) with a non-null value, followed by null values ; using these groups, we can identify the "preceding" and "following" values, count the number of missing values and their positions : that's the second layer of the query. The most outer query does the linear regression.
Now it is quite easy to compute the new_boosters
metrics, using lag()
. Yur code already has a similar logic ; We can wrap the above query in a CTE, and then:
with cte as ( ... above query ...)
select location, date, total_boosters,
total_boosters - lag_total_boosters as boosters_new,
avg(total_boosters - lag_total_boosters) over(
partition by location
order by date rows between 89 preceding and current row
) boosters_90roll
from (
select c.*,
lag(total_boosters) over(partition by location order by date) lag_total_boosters
from cte c
) c
location | date | total_boosters | boosters_new | boosters_90roll |
---|---|---|---|---|
Albania | 2022-03-07 | 245527.0000 | null | null |
Albania | 2022-03-08 | 246515.0000 | 988.0000 | 988.0000 |
Albania | 2022-03-09 | 247503.0000 | 988.0000 | 1976.0000 |
Albania | 2022-03-10 | 248491.0000 | 988.0000 | 2964.0000 |
Albania | 2022-03-11 | 249124.2500 | 633.2500 | 3597.2500 |
Albania | 2022-03-12 | 249757.5000 | 633.2500 | 4230.5000 |
Albania | 2022-03-13 | 250390.7500 | 633.2500 | 4863.7500 |
Albania | 2022-03-14 | 251024.0000 | 633.2500 | 5497.0000 |
Albania | 2022-03-15 | 252161.0000 | 1137.0000 | 6634.0000 |
Upvotes: 0