Reputation: 83
I've noticed some unexpected behavior with the interpolate_by expression and I'm not sure what is going on.
df = pl.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': [4, 5, None, 7, 8]
})
df = df.with_columns(interpolate = pl.col('b').interpolate_by('a'))
print(df)
results in this:
┌─────┬──────┬─────────────┐
│ a ┆ b ┆ interpolate │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪══════╪═════════════╡
│ 1 ┆ 4 ┆ 4.0 │
│ 2 ┆ 5 ┆ 5.0 │
│ 3 ┆ null ┆ 6.0 │
│ 4 ┆ 7 ┆ 7.0 │
│ 5 ┆ 8 ┆ 8.0 │
└─────┴──────┴─────────────┘
which is correct. However this:
df = pl.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': [4, 5, 6, 7, None]
})
df = df.with_columns(interpolate = pl.col('b').interpolate_by('a'))
print(df)
results in this:
shape: (5, 3)
┌─────┬──────┬─────────────┐
│ a ┆ b ┆ interpolate │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪══════╪═════════════╡
│ 1 ┆ 4 ┆ 4.0 │
│ 2 ┆ 5 ┆ 5.0 │
│ 3 ┆ 6 ┆ 6.0 │
│ 4 ┆ 7 ┆ 7.0 │
│ 5 ┆ null ┆ null │
└─────┴──────┴─────────────┘
which is not correct. There is still plenty of data to perform a linear interpolation on column B using the data in column A. Am I missing something here and don't understand how this is supposed to work or is this a bug?
Upvotes: 2
Views: 64
Reputation: 18691
Just a quick pedantic point, polars isn't failing to interpolate. You're wanting it to extrapolate.
If you want it to take the last trend and apply it then there's not a built in way to do that but you can do it yourself with expressions with forward_fill
and shift
.
df.with_columns(
pl.col("b").fill_null(
((pl.col("b")-pl.col("b").shift())/(pl.col("a")-pl.col("a").shift()))
.forward_fill() *
(pl.col("a")-pl.col("a").shift()) + pl.col("b").shift()
)
)
This extrapolation is interchangeable with your first example's interpolate since there's a constant trend but it'd give you a much different answer if the trend was differing over time (see note at the end).
You can turn this into a function for reuse with
def extrapolate_by(col_to_fill:str|pl.Expr, index_col:str|pl.Expr) -> pl.Expr:
if isinstance(col_to_fill, str):
col_to_fill=pl.col(col_to_fill)
if isinstance(index_col, str):
index_col=pl.col(index_col)
return col_to_fill.fill_null(
((col_to_fill-col_to_fill.shift())/(index_col-index_col.shift()))
.forward_fill() *
((index_col-index_col.shift())) + col_to_fill.shift()
)
and then you can simple do df.with_columns(extrapolate_by("b","a"))
or if you don't mind not having type hints, you can monkey patch it to pl.Expr
and its usage will be chainable
pl.Expr.extrapolate_by = extrapolate_by
df.with_columns(pl.col("b").extrapolate_by("a"))
As is, this only works forward, if you want to fill missing values at the beginning, you'd have to inverse the shift direction and fill direction.
Note on difference between extrapolate and interpolate.
df = pl.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': [4, 6, None, 20, 22]
})
df.with_columns(
interpolate=pl.col("b").interpolate_by("a"),
extrapolate=pl.col("b").extrapolate_by("a")
)
shape: (5, 4)
┌─────┬──────┬─────────────┬─────────────┐
│ a ┆ b ┆ interpolate ┆ extrapolate │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═════╪══════╪═════════════╪═════════════╡
│ 1 ┆ 4 ┆ 4.0 ┆ 4.0 │
│ 2 ┆ 6 ┆ 6.0 ┆ 6.0 │
│ 3 ┆ null ┆ 13.0 ┆ 8.0 │
│ 4 ┆ 20 ┆ 20.0 ┆ 20.0 │
│ 5 ┆ 22 ┆ 22.0 ┆ 22.0 │
└─────┴──────┴─────────────┴─────────────┘
When you interpolate it takes the known values on either side of the missing value to deduce what the missing value should have been. Extrapolating only takes old information to apply it to the future. In this example the trend changes so there's a big difference between extrapolating and interpolating so just be aware that these aren't always interchangeable.
Upvotes: 0