scotsman60
scotsman60

Reputation: 381

How to select the column from a Polars Dataframe that has the largest sum?

I have Polars dataframe with a bunch of columns I need to find the column with, for example, the largest sum.

The below snippet sums all of the columns:

df = pl.DataFrame(
    {
        "a": [0, 1, 3, 4],
        "b": [0, 0, 0, 0],
        "c": [1, 0, 1, 0],
    }
)

max_col = df.select(pl.col(df.columns).sum())  
shape: (1, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 8   ┆ 0   ┆ 2   │
└─────┴─────┴─────┘

But I'm missing the last step of selecting the column with the largest value?

Upvotes: 3

Views: 2111

Answers (3)

Luca
Luca

Reputation: 1914

One more option is to do a sum and a transpose. The transpose is an expensive operation, but here we use it on the already aggregated dataframe so the impact should not be big.

As an advantage, it makes for a short code:

df.select(
    df.sum().transpose(include_header=True)
    .sort('column_0', descending=True).head(1)[0,0]
    )

# Result
shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 1   │
│ 3   │
│ 4   │
└─────┘

Upvotes: 1

Rodalm
Rodalm

Reputation: 5503

You can use polars.DataFrame.row to get the column sums as a dictionary (row(0, named=True)). Then use max to find the column with the highest sum.

df = pl.DataFrame(
    {
        "a": [0, 1, 3, 4],
        "b": [0, 0, 0, 0],
        "c": [1, 0, 1, 0],
    }
)

col_sum_dict = df.select(pl.all().sum()).row(0, named=True)
res = df.select(
    [col for col, col_sum in col_sum_dict.items() 
     if col_sum == max(col_sum_dict.values())]
)


Output:

>>> res

shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 1   │
│ 3   │
│ 4   │
└─────┘

>>> col_sum_dict
{'a': 8, 'b': 0, 'c': 2}

Note that this approach works if there is a tie, i.e. multiple columns with the highest sum. For example:

df = pl.DataFrame(
    {
        "a": [0, 1, 3, 4],
        "same_as_a": [0, 1, 3, 4],
        "b": [0, 0, 0, 0],
        "c": [1, 0, 1, 0],
    }
)

results in

>>> res

shape: (4, 2)
┌─────┬───────────┐
│ a   ┆ same_as_a │
│ --- ┆ ---       │
│ i64 ┆ i64       │
╞═════╪═══════════╡
│ 0   ┆ 0         │
│ 1   ┆ 1         │
│ 3   ┆ 3         │
│ 4   ┆ 4         │
└─────┴───────────┘

Upvotes: 1

Dean MacGregor
Dean MacGregor

Reputation: 18691

I would do this as a unpivot/filter.

  df \
    .select(pl.all().sum()) \
    .unpivot() \
    .filter(pl.col('value')==pl.col('value').max())

If you want the original shape then a single chain is a bit tougher. I'd just do it like this instead.

allcalc=df \
    .select(pl.all().sum())
allcalc.select(allcalc.unpivot().filter(pl.col('value')==pl.col('value').max()) \
       .get_column('variable').to_list())

The above works if there is a tie, for instance if you have:

df=pl.DataFrame(
    {
        "a": [0, 1, 3, 4],
        "b": [0, 0, 0, 0],
        "c": [1, 0, 1, 6],
    }
)

then you'll get 'a' and 'c' in either case.

Upvotes: 0

Related Questions