Reputation: 181
What is the polars equivalent of pandas.pivot_table.reindex? I am trying to convert my pandas code to polars and I hit a snag.
Modified example from polars user guide
import polars as pl
import pandas as pd
df = pl.DataFrame(
{
"foo": ["A", "A", "B", "B", "C"],
"N": [1, 2, 2, 4, 2],
"bar": ["k", "l", "n", "n", "o"],
}
)
df_pl = df.pivot(index="foo", columns="bar",
values="N", aggregate_function="max")
print (df_pl)
shape: (3, 5)
┌─────┬──────┬──────┬──────┬──────┐
│ foo ┆ k ┆ l ┆ n ┆ o │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪══════╪══════╡
│ A ┆ 1 ┆ 2 ┆ null ┆ null │
│ B ┆ null ┆ null ┆ 4 ┆ null │
│ C ┆ null ┆ null ┆ null ┆ 2 │
└─────┴──────┴──────┴──────┴──────┘
equivalent code of pandas for the above polars pivot output
df_pd = pd.pivot_table(df.to_pandas(),
values = "N",
index = "foo",
columns = ["bar"],
aggfunc= 'max',
sort=True)
print (df_pd)
bar k l n o
foo
A 1.0 2.0 NaN NaN
B NaN NaN 4.0 NaN
C NaN NaN NaN 2.0
Now, I would like to reindex the pivot table in certain order. As you can see, pandas added the new column with all NaN's as the data for column "m" doesn't exist in the original df.
reindex_list = ['k', 'm', 'l', 'n', 'o']
df_pd_reindexed = df_pd.reindex(reindex_list, axis = 1)
print (df_pd_reindexed)
bar k m l n o
foo
A 1.0 NaN 2.0 NaN NaN
B NaN NaN NaN 4.0 NaN
C NaN NaN NaN NaN 2.0
How to get similar output using polars?
I tried the df.select method to order the columns, but it doesn't help in this scenario as the column is missing and polars throws an error which is obvious. Unless, I need to add the missing columns manually and populate them with nulls and then call the select method. I am looking for a simpler method than adding the missing columns manually.
df_pl = df_pl.select(["foo"] + reindex_list)
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python39\lib\site-packages\polars\dataframe\frame.py", line 8097, in select
return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
File "C:\Python39\lib\site-packages\polars\lazyframe\frame.py", line 1730, in collect
return wrap_df(ldf.collect())
polars.exceptions.ColumnNotFoundError: m
Error originated just after this operation:
DF ["foo", "k", "l", "n"]; PROJECT */5 COLUMNS; SELECTION: "None"
Any guidance is much appreciated. Thanks
Upvotes: 4
Views: 233
Reputation: 2961
One possible approach would be to add any needed new blank (i.e., all null
) columns using pl.DataFrame.with_columns
chained to a pl.DataFrame.select
for enforcing specified columns order:
import polars as pl
df = pl.DataFrame(
{
"foo": ["A", "A", "B", "B", "C"],
"N": [1, 2, 2, 4, 2],
"bar": ["k", "l", "n", "n", "o"],
}
)
print(df)
df_pl = df.pivot(
index="foo", columns="bar", values="N", aggregate_function="max"
)
print(df_pl)
reindex_list = ["k", "m", "l", "n", "o"]
df_pl_reindexed = df_pl.with_columns(
pl.lit(None).alias(c)
for c in list(set(reindex_list) - set(df_pl.columns))
).select(["foo"] + reindex_list)
print(df_pl_reindexed)
gives:
shape: (5, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ A ┆ 1 ┆ k │
│ A ┆ 2 ┆ l │
│ B ┆ 2 ┆ n │
│ B ┆ 4 ┆ n │
│ C ┆ 2 ┆ o │
└─────┴─────┴─────┘
shape: (3, 5)
┌─────┬──────┬──────┬──────┬──────┐
│ foo ┆ k ┆ l ┆ n ┆ o │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪══════╪══════╡
│ A ┆ 1 ┆ 2 ┆ null ┆ null │
│ B ┆ null ┆ null ┆ 4 ┆ null │
│ C ┆ null ┆ null ┆ null ┆ 2 │
└─────┴──────┴──────┴──────┴──────┘
shape: (3, 6)
┌─────┬──────┬──────┬──────┬──────┬──────┐
│ foo ┆ k ┆ m ┆ l ┆ n ┆ o │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ null ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪══════╪══════╪══════╡
│ A ┆ 1 ┆ null ┆ 2 ┆ null ┆ null │
│ B ┆ null ┆ null ┆ null ┆ 4 ┆ null │
│ C ┆ null ┆ null ┆ null ┆ null ┆ 2 │
└─────┴──────┴──────┴──────┴──────┴──────┘
Upvotes: 1