sexyspecs
sexyspecs

Reputation: 121

Multiple polars DataFrames with same shape: update one will non-null values of the other

I have a 100x100 DataFrame (representing a Gaussian KDE) that represents a small sample of data so there are many holes. I have another 100x100 df that represents a larger sample of data with no holes. I would like to combine these dfs into a single 100x100 df but with a preference for the smaller df. In other words, I would like to fill in all the missing gaps in the smaller df with data from the larger df.

I can think of two ways of doing this:

  1. Start with the smaller df and replace all null values with the equivalent value from the larger
  2. Start with the larger df and overlay all non-null values of the smaller df

Is there an efficient way to do this that doesn't involve going cell-by-cell?

Upvotes: 0

Views: 644

Answers (1)

user20557510
user20557510

Reputation:

You can use the coalesce function in Polars, which also handles cases where you are coalescing more than two DataFrames.

import polars as pl
from datetime import date

small_sample = pl.DataFrame({
    'col_bool': [False, None, None],
    'col_float': [1.0, None, None],
    'col_dt': [date(2022, 1, 1), None, None],
})
small_sample


medium_sample = pl.DataFrame({
    'col_bool': [True, True, None],
    'col_float': [10.0, 20.0, None],
    'col_dt': [date(2023, 1, 1), date(2023, 2, 1), None],
})
medium_sample


large_sample = pl.DataFrame({
    'col_bool': [True, True, True],
    'col_float': [100.0, 200.0, 300.0],
    'col_dt': [date(2024, 1, 1), date(2024, 2, 1), date(2024, 3, 1)],
})
large_sample
>>> small_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ false    ┆ 1.0       ┆ 2022-01-01 │
│ null     ┆ null      ┆ null       │
│ null     ┆ null      ┆ null       │
└──────────┴───────────┴────────────┘

>>> medium_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ true     ┆ 10.0      ┆ 2023-01-01 │
│ true     ┆ 20.0      ┆ 2023-02-01 │
│ null     ┆ null      ┆ null       │
└──────────┴───────────┴────────────┘

>>> large_sample
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ true     ┆ 100.0     ┆ 2024-01-01 │
│ true     ┆ 200.0     ┆ 2024-02-01 │
│ true     ┆ 300.0     ┆ 2024-03-01 │
└──────────┴───────────┴────────────┘
pl.select(
    pl.coalesce(small, medium, large)
    for small, medium, large in zip(small_sample, medium_sample, large_sample)
)
shape: (3, 3)
┌──────────┬───────────┬────────────┐
│ col_bool ┆ col_float ┆ col_dt     │
│ ---      ┆ ---       ┆ ---        │
│ bool     ┆ f64       ┆ date       │
╞══════════╪═══════════╪════════════╡
│ false    ┆ 1.0       ┆ 2022-01-01 │
│ true     ┆ 20.0      ┆ 2023-02-01 │
│ true     ┆ 300.0     ┆ 2024-03-01 │
└──────────┴───────────┴────────────┘

Upvotes: 2

Related Questions