Reputation: 5202
I have this example dataframe in polars:
df_example = pl.DataFrame(
{
"DATE": ["2024-11-11", "2024-11-11", "2024-11-12", "2024-11-12", "2024-11-13"],
"A": [None, None, "option1", "option2", None],
"B": [None, None, "YES", "YES", "NO"],
}
)
Which looks like this:
DATE | A | B | |
---|---|---|---|
0 | 2024-11-11 | ||
1 | 2024-11-11 | ||
2 | 2024-11-12 | option1 | YES |
3 | 2024-11-12 | option2 | YES |
4 | 2024-11-13 | NO |
As you can see this is a long format dataframe. I want to have it in a wide format, meaning that I want the DATE to be unique per row and for each other column several columns have to be created. What I want to achieve is:
DATE | A_option1 | A_option2 | B_YES | B_NO | |
---|---|---|---|---|---|
2024-11-11 | Null | Null | Null | Null | |
2024-11-12 | True | True | True | Null | |
2024-11-13 | Null | Null | Null | True |
I have tried doing the following:
df_example.pivot(
index="DATE", on=["A", "B"], values=["A", "B"], aggregate_function="first"
)
However, I get this error:
DuplicateError: column with name 'null' has more than one occurrence
Which is logical, as it tries to create a column for the Null values in columns A, and a column for the Null values in column B.
I am looking for a clean solution to this problem. I know I can impute the nulls per column with something unique and then do the pivot. Or by pivoting per column and then dropping the Null columns. However, this will create unnecessary columns. I want something more elegant.
Upvotes: 2
Views: 86
Reputation: 117540
pl.DataFrame.pivot()
.pl.element()
, pl.Expr.is_not_null()
and pl.Expr.any()
to convert values to true / false
.pl.DataFrame.drop()
drop columns with only nulls.pl.Expr.name.prefix()
and pl.exclude()
to rename the columns.Do this for each column separately and then
pl.concat()
resulting dataframes.pl.concat([
df_example
.pivot(
index="DATE", values=col, on=col,
aggregate_function=pl.element().drop_nulls().is_not_null().first()
)
.drop('null')
.select(
pl.col("DATE"),
pl.exclude("DATE").name.prefix(f"{col}_")
)
for col in df_example.columns if col != "DATE"
], how="align")
shape: (3, 5)
┌────────────┬───────────┬───────────┬───────┬──────┐
│ DATE ┆ A_option1 ┆ A_option2 ┆ B_YES ┆ B_NO │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ bool ┆ bool ┆ bool │
╞════════════╪═══════════╪═══════════╪═══════╪══════╡
│ 2024-11-11 ┆ null ┆ null ┆ null ┆ null │
│ 2024-11-12 ┆ true ┆ true ┆ true ┆ null │
│ 2024-11-13 ┆ null ┆ null ┆ null ┆ true │
└────────────┴───────────┴───────────┴───────┴──────┘
alternatively
pl.DataFrame.unpivot()
to convert columns to rows.pl.DataFrame.unique()
to drop duplicate empty rows.pl.DataFrame.pivot()
to pivot it back.(
df_example
.unpivot(index="DATE")
.select(pl.col.DATE, value = pl.col.variable + "_" + pl.col.value)
.unique(maintain_order=True)
.pivot(
index="DATE", on="value", values="value",
aggregate_function=pl.element().is_not_null().any()
)
.drop("null")
)
shape: (3, 5)
┌────────────┬───────────┬───────────┬───────┬──────┐
│ DATE ┆ A_option1 ┆ A_option2 ┆ B_YES ┆ B_NO │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ bool ┆ bool ┆ bool │
╞════════════╪═══════════╪═══════════╪═══════╪══════╡
│ 2024-11-11 ┆ null ┆ null ┆ null ┆ null │
│ 2024-11-12 ┆ true ┆ true ┆ true ┆ null │
│ 2024-11-13 ┆ null ┆ null ┆ null ┆ true │
└────────────┴───────────┴───────────┴───────┴──────┘
Upvotes: 0
Reputation: 5202
I ended up with:
(
df_example.pipe(
lambda df: df.group_by("DATE").agg(
[
pl.col(col).eq(val).any().alias(f"{col}_{val}")
for col in df.select(pl.exclude("DATE")).columns
for val in df.get_column(col).unique().drop_nulls()
]
)
).sort("DATE")
)
Upvotes: 1