Jonas Hörsch
Jonas Hörsch

Reputation: 535

Fill null values with linear interpolation using DuckDB

I am trying to fill null values in a numeric table using duckdb.

Let's consider the simple table df:

x y
0 2.0
1 NULL
3 NULL
5 7.0
6 4.0
8 5.0
9 NULL

that you can create with

CREATE TABLE df (x NUMERIC, y NUMERIC)
INSERT INTO df VALUES (0, 2), (1, NULL), (3, NULL), (5, 7), (6, 4), (8, 5), (9, NULL)

then i would like to create a SELECT statement to retrieve:

x y
0 2.0
1 3.0
3 5.0
5 7.0
6 4.0
8 5.0
9 NULL

I think that a correlated subquery which finds the following and previous non-null values and then computes the interpolation should work:

SELECT
    x,
    COALESCE(
        y,
        (
            SELECT 
                prev.y + ( (next.y - prev.y) * (parent.x - prev.x) / (next.x - prev.x) )
            FROM
                ( SELECT x, y FROM df WHERE x <= parent.x and y is not null ORDER BY x DESC LIMIT 1 ) AS prev
            CROSS JOIN
                ( SELECT x, y FROM df WHERE x >= parent.x and y is not null ORDER BY x ASC LIMIT 1 ) AS next
        )
    ) AS y
FROM
    df parent

But duckdb is complaining with

INTERNAL Error: Failed to bind column reference "limit_rownum" [30.0] (bindings: [8.0 8.1 8.2 28.0])

EDIT: Adjusted the example to ensure that x and y columns are both taken into account for interpolating the gaps.

Upvotes: 1

Views: 947

Answers (1)

Paul Maxwell
Paul Maxwell

Reputation: 35583

I believe you can bypass the error message by adopting row_number() as proposed in this duckdb bug reference. nb: The fiddle below uses postgres as I'm unable to test in duckdb:

CREATE TABLE df (x NUMERIC(3,1), y NUMERIC(3,1));

INSERT INTO df VALUES (0, 2), (1, NULL), (3, NULL), (5, 7), (6, 4), (8, 5), (9, NULL)

SELECT
    x,
    COALESCE(
        y,
        (
            SELECT 
                prev.y + ( (next.y - prev.y) * (parent.x - prev.x) / (next.x - prev.x) )
            FROM
                ( SELECT x, y, ROW_NUMBER() OVER (ORDER BY x DESC) as rn FROM df WHERE x <= parent.x and y is not null ) AS prev
            CROSS JOIN
                ( SELECT x, y, ROW_NUMBER() OVER (ORDER BY x ASC) as rn FROM df WHERE x >= parent.x and y is not null ) AS next
            WHERE
                prev.rn = next.rn
        )
    ) AS y
FROM
    df parent
x y
0.0 2.0
1.0 3.00000000000000000000
3.0 5.0000000000000000
5.0 7.0
6.0 4.0
8.0 5.0
9.0 null

fiddle

Upvotes: 2

Related Questions