HumpbackWhale194
HumpbackWhale194

Reputation: 418

With Python Polars, how to compare two frame (like with `==`), but while returning True/False on comparisons with null

With python polars, how can I compare two dataframes (like with ==), and get a comparison result per-cell, but while returning True/False on comparisons with null. By default, doing df1 == df2 results in null being in any cells where either df1 or df2 contains a null.

For example:

df1 = pl.DataFrame(
    {
        "a": [1, 2, 3, None, 5],
        "b": [5, 4, 3, 2, None],
    }
)


df2 = pl.DataFrame(
    {
        "a": [1, 2, 3, 1, 5],
        "b": [5, 4, 30, 2, None],
    }
)


print(f"df1: {df1}")
print(f"df2: {df2}")

print(f"df1 == df2: {df1 == df2}")

Results in:

df1: shape: (5, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 5    │
│ 2    ┆ 4    │
│ 3    ┆ 3    │
│ null ┆ 2    │
│ 5    ┆ null │
└──────┴──────┘
df2: shape: (5, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ 5    │
│ 2   ┆ 4    │
│ 3   ┆ 30   │
│ 1   ┆ 2    │
│ 5   ┆ null │
└─────┴──────┘
df1 == df2: shape: (5, 2)
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ bool ┆ bool  │
╞══════╪═══════╡
│ true ┆ true  │
│ true ┆ true  │
│ true ┆ false │
│ null ┆ true  │
│ true ┆ null  │
└──────┴───────┘

However, I'm trying to determine how to get the following result:

df1 compared to df2: shape: (5, 2)
┌──────┬───────┐
│ a    ┆ b     │
│ ---  ┆ ---   │
│ bool ┆ bool  │
╞══════╪═══════╡
│ true ┆ true  │
│ true ┆ true  │
│ true ┆ false │
│false ┆ true  │ <- false b/c cell is null in one DF, and a value in the other
│ true ┆ true  │ <- bottom-right cell is true
└──────┴───────┘     because df1 and df2 have the same value (null)

Upvotes: 4

Views: 294

Answers (3)

Cameron Riddell
Cameron Riddell

Reputation: 13417

If you encounter the scenario where you have DataFrames with different columns and want to operate across the union of their columns you can extend the suggestion by @Roman Pekar to:

import polars as pl

df1 = pl.DataFrame(
    {
        "a": [1, 2, 3, None, 5],
        "b": [5, 4, 3, 2, None],
        "d": [None, 10, 20, 1, None],
    }
)


df2 = pl.DataFrame(
    {
        "a": [1, 2, 3, 1, 5],
        "b": [5, 4, 30, 2, None],
        "c": [5, None, 40, 0, -1],
    }
)

Then we can adapt the great solutions suggested by @Hericks and @Roman Pekar.

# sorting is not necessary, but creates consistent column ordering across runs
shared_cols = sorted(set(df1.columns) & set(df2.columns))
uniq_cols = sorted(set(df1.columns) ^ set(df2.columns))
suffix = '__b'

# should also perform some height checking

# @Hericks approach
print(
    pl.concat([df1, df2.rename(lambda c: f'{c}{suffix}')], how='horizontal')
    .select(
        *(pl.col(c).eq_missing(pl.col(f'{c}{suffix}')) for c in shared_cols),
        *(pl.repeat(False, n=len(df1)).alias(c) for c in uniq_cols)
    )
)
# shape: (5, 4)
# ┌───────┬───────┬───────┬───────┐
# │ a     ┆ b     ┆ c     ┆ d     │
# │ ---   ┆ ---   ┆ ---   ┆ ---   │
# │ bool  ┆ bool  ┆ bool  ┆ bool  │
# ╞═══════╪═══════╪═══════╪═══════╡
# │ true  ┆ true  ┆ false ┆ false │
# │ true  ┆ true  ┆ false ┆ false │
# │ true  ┆ false ┆ false ┆ false │
# │ false ┆ true  ┆ false ┆ false │
# │ true  ┆ true  ┆ false ┆ false │
# └───────┴───────┴───────┴───────┘

# @Roman Pekars approach
print(
    df1.select(pl.col(c).eq_missing(df2[c]) for c in shared_cols)
    .with_columns(
        pl.repeat(False, n=len(df1)).alias(c) for c in uniq_cols
    )
)
# shape: (5, 4)
# ┌───────┬───────┬───────┬───────┐
# │ a     ┆ b     ┆ c     ┆ d     │
# │ ---   ┆ ---   ┆ ---   ┆ ---   │
# │ bool  ┆ bool  ┆ bool  ┆ bool  │
# ╞═══════╪═══════╪═══════╪═══════╡
# │ true  ┆ true  ┆ false ┆ false │
# │ true  ┆ true  ┆ false ┆ false │
# │ true  ┆ false ┆ false ┆ false │
# │ false ┆ true  ┆ false ┆ false │
# │ true  ┆ true  ┆ false ┆ false │
# └───────┴───────┴───────┴───────┘

Upvotes: 1

Hericks
Hericks

Reputation: 10049

You can easily check how polars implements pl.DataFrame.__eq__ under the hood. A helper function similar to the polars implementation, but relying on pl.Expr.eq_missing instead (as already mentioned in the answer above), could look as follows.

def _compare_to_other_df_missing(df, other):
    """
    Compare a DataFrame with another DataFrame respecting `None == None`.

    This differs from default comparison where null values are propagated.
    """
    if df.columns != other.columns:
        msg = "DataFrame columns do not match"
        raise ValueError(msg)
    if df.shape != other.shape:
        msg = "DataFrame dimensions do not match"
        raise ValueError(msg)

    suffix = "__POLARS_CMP_OTHER"
    other_renamed = other.select(pl.all().name.suffix(suffix))
    combined = pl.concat([df, other_renamed], how="horizontal")
    expr = [pl.col(n).eq_missing(pl.col(f"{n}{suffix}")) for n in df.columns]

    return combined.select(expr)
>>> _compare_to_other_df_missing(df1, df2)
shape: (5, 2)
┌───────┬───────┐
│ a     ┆ b     │
│ ---   ┆ ---   │
│ bool  ┆ bool  │
╞═══════╪═══════╡
│ true  ┆ true  │
│ true  ┆ true  │
│ true  ┆ false │
│ false ┆ true  │
│ true  ┆ true  │
└───────┴───────┘

Upvotes: 2

roman
roman

Reputation: 117410

It's a little more verbose that df1 == df2, but to have pure polars solution you can use eq_missing(). In addition, you can use the fact that df[col] returns column as pl.Series:

df1.select(pl.col(c).eq_missing(df2[c]) for c in df1.columns)

┌───────┬───────┐
│ a     ┆ b     │
│ ---   ┆ ---   │
│ bool  ┆ bool  │
╞═══════╪═══════╡
│ true  ┆ true  │
│ true  ┆ true  │
│ true  ┆ false │
│ false ┆ true  │
│ true  ┆ true  │
└───────┴───────┘

Upvotes: 2

Related Questions