Olivier_s_j
Olivier_s_j

Reputation: 5202

Duplicate null columns created during pivot in polars

I have this example dataframe in polars:

df_example = pl.DataFrame(
    {
        "DATE": ["2024-11-11", "2024-11-11", "2024-11-12", "2024-11-12", "2024-11-13"],
        "A": [None, None, "option1", "option2", None],
        "B": [None, None, "YES", "YES", "NO"],
    }
)

Which looks like this:

DATE A B
0 2024-11-11
1 2024-11-11
2 2024-11-12 option1 YES
3 2024-11-12 option2 YES
4 2024-11-13 NO

As you can see this is a long format dataframe. I want to have it in a wide format, meaning that I want the DATE to be unique per row and for each other column several columns have to be created. What I want to achieve is:

DATE A_option1 A_option2 B_YES B_NO
2024-11-11 Null Null Null Null
2024-11-12 True True True Null
2024-11-13 Null Null Null True

I have tried doing the following:

df_example.pivot(
    index="DATE", on=["A", "B"], values=["A", "B"], aggregate_function="first"
)

However, I get this error:

DuplicateError: column with name 'null' has more than one occurrence

Which is logical, as it tries to create a column for the Null values in columns A, and a column for the Null values in column B.

I am looking for a clean solution to this problem. I know I can impute the nulls per column with something unique and then do the pivot. Or by pivoting per column and then dropping the Null columns. However, this will create unnecessary columns. I want something more elegant.

Upvotes: 2

Views: 86

Answers (2)

roman
roman

Reputation: 117540

Do this for each column separately and then

pl.concat([
    df_example
    .pivot(
        index="DATE", values=col, on=col, 
        aggregate_function=pl.element().drop_nulls().is_not_null().first()
    )
    .drop('null')
    .select(
        pl.col("DATE"),
        pl.exclude("DATE").name.prefix(f"{col}_")
    )
    for col in df_example.columns if col != "DATE"
], how="align")
shape: (3, 5)
┌────────────┬───────────┬───────────┬───────┬──────┐
│ DATE       ┆ A_option1 ┆ A_option2 ┆ B_YES ┆ B_NO │
│ ---        ┆ ---       ┆ ---       ┆ ---   ┆ ---  │
│ str        ┆ bool      ┆ bool      ┆ bool  ┆ bool │
╞════════════╪═══════════╪═══════════╪═══════╪══════╡
│ 2024-11-11 ┆ null      ┆ null      ┆ null  ┆ null │
│ 2024-11-12 ┆ true      ┆ true      ┆ true  ┆ null │
│ 2024-11-13 ┆ null      ┆ null      ┆ null  ┆ true │
└────────────┴───────────┴───────────┴───────┴──────┘

alternatively

(
    df_example
    .unpivot(index="DATE")
    .select(pl.col.DATE, value = pl.col.variable + "_" + pl.col.value)
    .unique(maintain_order=True)
    .pivot(
        index="DATE", on="value", values="value",
        aggregate_function=pl.element().is_not_null().any()
    )
    .drop("null")
)
shape: (3, 5)
┌────────────┬───────────┬───────────┬───────┬──────┐
│ DATE       ┆ A_option1 ┆ A_option2 ┆ B_YES ┆ B_NO │
│ ---        ┆ ---       ┆ ---       ┆ ---   ┆ ---  │
│ str        ┆ bool      ┆ bool      ┆ bool  ┆ bool │
╞════════════╪═══════════╪═══════════╪═══════╪══════╡
│ 2024-11-11 ┆ null      ┆ null      ┆ null  ┆ null │
│ 2024-11-12 ┆ true      ┆ true      ┆ true  ┆ null │
│ 2024-11-13 ┆ null      ┆ null      ┆ null  ┆ true │
└────────────┴───────────┴───────────┴───────┴──────┘

Upvotes: 0

Olivier_s_j
Olivier_s_j

Reputation: 5202

I ended up with:

(
    df_example.pipe(
        lambda df: df.group_by("DATE").agg(
            [
                pl.col(col).eq(val).any().alias(f"{col}_{val}")
                for col in df.select(pl.exclude("DATE")).columns
                for val in df.get_column(col).unique().drop_nulls()
            ]
        )
    ).sort("DATE")
)

Upvotes: 1

Related Questions