Reputation: 418
With python polars, how can I compare two dataframes (like with ==
), and get a comparison result per-cell, but while returning True
/False
on comparisons with null. By default, doing df1 == df2
results in null
being in any cells where either df1
or df2
contains a null
.
For example:
df1 = pl.DataFrame(
{
"a": [1, 2, 3, None, 5],
"b": [5, 4, 3, 2, None],
}
)
df2 = pl.DataFrame(
{
"a": [1, 2, 3, 1, 5],
"b": [5, 4, 30, 2, None],
}
)
print(f"df1: {df1}")
print(f"df2: {df2}")
print(f"df1 == df2: {df1 == df2}")
Results in:
df1: shape: (5, 2)
┌──────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 5 │
│ 2 ┆ 4 │
│ 3 ┆ 3 │
│ null ┆ 2 │
│ 5 ┆ null │
└──────┴──────┘
df2: shape: (5, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 5 │
│ 2 ┆ 4 │
│ 3 ┆ 30 │
│ 1 ┆ 2 │
│ 5 ┆ null │
└─────┴──────┘
df1 == df2: shape: (5, 2)
┌──────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ bool ┆ bool │
╞══════╪═══════╡
│ true ┆ true │
│ true ┆ true │
│ true ┆ false │
│ null ┆ true │
│ true ┆ null │
└──────┴───────┘
However, I'm trying to determine how to get the following result:
df1 compared to df2: shape: (5, 2)
┌──────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ bool ┆ bool │
╞══════╪═══════╡
│ true ┆ true │
│ true ┆ true │
│ true ┆ false │
│false ┆ true │ <- false b/c cell is null in one DF, and a value in the other
│ true ┆ true │ <- bottom-right cell is true
└──────┴───────┘ because df1 and df2 have the same value (null)
Upvotes: 4
Views: 294
Reputation: 13417
If you encounter the scenario where you have DataFrames with different columns and want to operate across the union of their columns you can extend the suggestion by @Roman Pekar to:
import polars as pl
df1 = pl.DataFrame(
{
"a": [1, 2, 3, None, 5],
"b": [5, 4, 3, 2, None],
"d": [None, 10, 20, 1, None],
}
)
df2 = pl.DataFrame(
{
"a": [1, 2, 3, 1, 5],
"b": [5, 4, 30, 2, None],
"c": [5, None, 40, 0, -1],
}
)
Then we can adapt the great solutions suggested by @Hericks and @Roman Pekar.
# sorting is not necessary, but creates consistent column ordering across runs
shared_cols = sorted(set(df1.columns) & set(df2.columns))
uniq_cols = sorted(set(df1.columns) ^ set(df2.columns))
suffix = '__b'
# should also perform some height checking
# @Hericks approach
print(
pl.concat([df1, df2.rename(lambda c: f'{c}{suffix}')], how='horizontal')
.select(
*(pl.col(c).eq_missing(pl.col(f'{c}{suffix}')) for c in shared_cols),
*(pl.repeat(False, n=len(df1)).alias(c) for c in uniq_cols)
)
)
# shape: (5, 4)
# ┌───────┬───────┬───────┬───────┐
# │ a ┆ b ┆ c ┆ d │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ bool ┆ bool ┆ bool ┆ bool │
# ╞═══════╪═══════╪═══════╪═══════╡
# │ true ┆ true ┆ false ┆ false │
# │ true ┆ true ┆ false ┆ false │
# │ true ┆ false ┆ false ┆ false │
# │ false ┆ true ┆ false ┆ false │
# │ true ┆ true ┆ false ┆ false │
# └───────┴───────┴───────┴───────┘
# @Roman Pekars approach
print(
df1.select(pl.col(c).eq_missing(df2[c]) for c in shared_cols)
.with_columns(
pl.repeat(False, n=len(df1)).alias(c) for c in uniq_cols
)
)
# shape: (5, 4)
# ┌───────┬───────┬───────┬───────┐
# │ a ┆ b ┆ c ┆ d │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ bool ┆ bool ┆ bool ┆ bool │
# ╞═══════╪═══════╪═══════╪═══════╡
# │ true ┆ true ┆ false ┆ false │
# │ true ┆ true ┆ false ┆ false │
# │ true ┆ false ┆ false ┆ false │
# │ false ┆ true ┆ false ┆ false │
# │ true ┆ true ┆ false ┆ false │
# └───────┴───────┴───────┴───────┘
Upvotes: 1
Reputation: 10049
You can easily check how polars implements pl.DataFrame.__eq__
under the hood. A helper function similar to the polars implementation, but relying on pl.Expr.eq_missing
instead (as already mentioned in the answer above), could look as follows.
def _compare_to_other_df_missing(df, other):
"""
Compare a DataFrame with another DataFrame respecting `None == None`.
This differs from default comparison where null values are propagated.
"""
if df.columns != other.columns:
msg = "DataFrame columns do not match"
raise ValueError(msg)
if df.shape != other.shape:
msg = "DataFrame dimensions do not match"
raise ValueError(msg)
suffix = "__POLARS_CMP_OTHER"
other_renamed = other.select(pl.all().name.suffix(suffix))
combined = pl.concat([df, other_renamed], how="horizontal")
expr = [pl.col(n).eq_missing(pl.col(f"{n}{suffix}")) for n in df.columns]
return combined.select(expr)
>>> _compare_to_other_df_missing(df1, df2)
shape: (5, 2)
┌───────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ true ┆ true │
│ true ┆ true │
│ true ┆ false │
│ false ┆ true │
│ true ┆ true │
└───────┴───────┘
Upvotes: 2
Reputation: 117410
It's a little more verbose that df1 == df2
, but to have pure polars solution you can use eq_missing()
. In addition, you can use the fact that df[col]
returns column as pl.Series
:
df1.select(pl.col(c).eq_missing(df2[c]) for c in df1.columns)
┌───────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ true ┆ true │
│ true ┆ true │
│ true ┆ false │
│ false ┆ true │
│ true ┆ true │
└───────┴───────┘
Upvotes: 2