RyanP
RyanP

Reputation: 11

Linearly Interpolate Missing Data in mySQL

COVID-19 Data obtained from OurWorldInData includes total_boosters column to track vaccine boosters in any given country. My goal is to create a column for daily new boosters (boosters_new), then create a 90 day rolling sum (boosters_90roll). The problem is total_boosters has a lot of missing data, so I would like to linearly interpolate the missing data. If the missing data is not filled in any way, then a rolling sum would be a poor estimate and would be visualized poorly.

I already have a working solution to this, but it is the trickiest query I've worked on. I am posting this question to learn if there are cleaner/more efficient ways of doing this.

Sample input:

location date total_boosters
Albania 2022-03-07 245527
Albania 2022-03-08
Albania 2022-03-09
Albania 2022-03-10 248491
Albania 2022-03-11
Albania 2022-03-12
Albania 2022-03-13
Albania 2022-03-14 251024
Albania 2022-03-15 252161

Sample output:

location date boosters_new boosters_90roll
Albania 2022-03-07 988 988
Albania 2022-03-08 988 1,976
Albania 2022-03-09 988 2,964
Albania 2022-03-10 633.25 3,597.25
Albania 2022-03-11 633.25 4,230.5
Albania 2022-03-12 633.25 4,863.75
Albania 2022-03-13 633.25 5,497
Albania 2022-03-14 1,137 6,634
Albania 2022-03-15 886.8889 7,520.8889

Current solution

-- Group total_boosters and sequential nulls as boosters_group, then counts and numbers the rows in each group
WITH BoostersGrouped AS
(
SELECT
    location,
    date,
    total_boosters,
    boosters_group,
    COUNT(boosters_group) OVER (
        PARTITION BY location, boosters_group
    ) AS group_count,
    ROW_NUMBER() OVER (
        PARTITION BY location, boosters_group
    ) AS group_nrow
FROM
    (
    SELECT
        location,
        date,
        total_boosters,
        COUNT(total_boosters) OVER (
            PARTITION BY location ORDER BY date
        ) AS boosters_group
    FROM Vaccinations
    ) grouped
),
-- Self join to get total_boosters for current and next booster_group
BoostersFilled AS
(
SELECT
    bg1.location,
    bg1.date,
    bg1.boosters_group,
    bg1.group_count,
    bg1.group_nrow,
    bg2.boosters_current,
    bg3.boosters_next
FROM BoostersGrouped bg1
LEFT JOIN
    (
    SELECT
        location,
        boosters_group,
        max(total_boosters) AS boosters_current
    FROM BoostersGrouped
    GROUP BY location, boosters_group
    ) bg2
    ON bg1.location = bg2.location
    AND bg1.boosters_group = bg2.boosters_group
LEFT JOIN 
    (
    SELECT
        location,
        boosters_group,
        max(total_boosters) AS boosters_next
    FROM BoostersGrouped
    GROUP BY location, boosters_group
    ) bg3
    ON bg1.location = bg3.location
    AND (bg1.boosters_group + 1) = bg3.boosters_group
),
-- Linearly interpolate total_boosters
BoostersInterp AS
(
SELECT
    location,
    date,
    CASE
        WHEN boosters_group = 0 THEN 0
        ELSE (boosters_current + ((boosters_next - boosters_current)* group_nrow / group_count))
    END AS boosters_interp
FROM BoostersFilled
),
-- Calculate daily boosters_new
BoostersNew AS
(
SELECT
    location,
    date,
    (boosters_interp - boosters_lag) AS boosters_new
FROM
    (
    SELECT
        location,
        date,
        boosters_interp,
        LAG(boosters_interp) OVER (
            PARTITION BY location ORDER BY date
        ) AS boosters_lag
    FROM BoostersInterp
    ) BoostersLag
)
-- Calculate 90 day rolling sum
SELECT
    location,
    date,
    boosters_new,
    SUM(boosters_new) OVER (
        PARTITION BY location ORDER BY date ROWS BETWEEN 89 PRECEDING AND CURRENT ROW
    ) AS boosters_90roll
FROM BoostersNew;

Upvotes: 1

Views: 177

Answers (1)

GMB
GMB

Reputation: 222632

That's an interesting question!

I suspect that logic can be simplified (and possibly become more efficient?) by making more use of window functions.

Here is a query that does the linear regression directly on the total_boosters metric.

select location, date, 
    coalesce(
        total_boosters,
        total_boosters1 + (total_boosters2 - total_boosters1) * (rn - 1) / cnt
    ) total_boosters
from (
    select v.*, 
        min(total_boosters) over(partition by location, grp1) total_boosters1,
        min(total_boosters) over(partition by location, grp2) total_boosters2,
        count(*) over(partition by location, grp1) cnt,
        row_number() over(partition by location, grp1 order by date) rn 
    from (
        select v.*, 
            count(total_boosters) over(partition by location order by date) grp1,
            count(total_boosters) over(partition by location order by date desc) grp2
        from vaccinations v
    ) v
) v

The logic is to define groups of rows that either start (grp1) or end (grp2) with a non-null value, followed by null values ; using these groups, we can identify the "preceding" and "following" values, count the number of missing values and their positions : that's the second layer of the query. The most outer query does the linear regression.

Now it is quite easy to compute the new_boosters metrics, using lag(). Yur code already has a similar logic ; We can wrap the above query in a CTE, and then:

with cte as ( ... above query ...)
select location, date, total_boosters,
    total_boosters - lag_total_boosters as boosters_new,
    avg(total_boosters - lag_total_boosters) over(
        partition by location
        order by date rows between 89 preceding and current row
    ) boosters_90roll
from (
    select c.*, 
        lag(total_boosters) over(partition by location order by date) lag_total_boosters
    from cte c
) c

Demo on DB Fiddle:

location date total_boosters boosters_new boosters_90roll
Albania 2022-03-07 245527.0000 null null
Albania 2022-03-08 246515.0000 988.0000 988.0000
Albania 2022-03-09 247503.0000 988.0000 1976.0000
Albania 2022-03-10 248491.0000 988.0000 2964.0000
Albania 2022-03-11 249124.2500 633.2500 3597.2500
Albania 2022-03-12 249757.5000 633.2500 4230.5000
Albania 2022-03-13 250390.7500 633.2500 4863.7500
Albania 2022-03-14 251024.0000 633.2500 5497.0000
Albania 2022-03-15 252161.0000 1137.0000 6634.0000

Upvotes: 0

Related Questions