Reputation: 381
I have Polars dataframe with a bunch of columns I need to find the column with, for example, the largest sum.
The below snippet sums all of the columns:
df = pl.DataFrame(
{
"a": [0, 1, 3, 4],
"b": [0, 0, 0, 0],
"c": [1, 0, 1, 0],
}
)
max_col = df.select(pl.col(df.columns).sum())
shape: (1, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 8 ┆ 0 ┆ 2 │
└─────┴─────┴─────┘
But I'm missing the last step of selecting the column with the largest value?
Upvotes: 3
Views: 2111
Reputation: 1914
One more option is to do a sum
and a transpose
.
The transpose
is an expensive operation, but here we use it on the already aggregated dataframe so the impact should not be big.
As an advantage, it makes for a short code:
df.select(
df.sum().transpose(include_header=True)
.sort('column_0', descending=True).head(1)[0,0]
)
# Result
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
│ 1 │
│ 3 │
│ 4 │
└─────┘
Upvotes: 1
Reputation: 5503
You can use polars.DataFrame.row
to get the column sums as a dictionary (row(0, named=True)
). Then use max
to find the column with the highest sum.
df = pl.DataFrame(
{
"a": [0, 1, 3, 4],
"b": [0, 0, 0, 0],
"c": [1, 0, 1, 0],
}
)
col_sum_dict = df.select(pl.all().sum()).row(0, named=True)
res = df.select(
[col for col, col_sum in col_sum_dict.items()
if col_sum == max(col_sum_dict.values())]
)
Output:
>>> res
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
│ 1 │
│ 3 │
│ 4 │
└─────┘
>>> col_sum_dict
{'a': 8, 'b': 0, 'c': 2}
Note that this approach works if there is a tie, i.e. multiple columns with the highest sum. For example:
df = pl.DataFrame(
{
"a": [0, 1, 3, 4],
"same_as_a": [0, 1, 3, 4],
"b": [0, 0, 0, 0],
"c": [1, 0, 1, 0],
}
)
results in
>>> res
shape: (4, 2)
┌─────┬───────────┐
│ a ┆ same_as_a │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═══════════╡
│ 0 ┆ 0 │
│ 1 ┆ 1 │
│ 3 ┆ 3 │
│ 4 ┆ 4 │
└─────┴───────────┘
Upvotes: 1
Reputation: 18691
I would do this as a unpivot/filter.
df \
.select(pl.all().sum()) \
.unpivot() \
.filter(pl.col('value')==pl.col('value').max())
If you want the original shape then a single chain is a bit tougher. I'd just do it like this instead.
allcalc=df \
.select(pl.all().sum())
allcalc.select(allcalc.unpivot().filter(pl.col('value')==pl.col('value').max()) \
.get_column('variable').to_list())
The above works if there is a tie, for instance if you have:
df=pl.DataFrame(
{
"a": [0, 1, 3, 4],
"b": [0, 0, 0, 0],
"c": [1, 0, 1, 6],
}
)
then you'll get 'a' and 'c' in either case.
Upvotes: 0