Reputation: 93

How to multiply each element in a list with a value in a different column?

I have a dataframe with a certain number of groups, containing a weight column and a list of values, which can be of arbitrary length, so for example:

df = pl.DataFrame(
    {
        "Group": ["Group1", "Group2", "Group3"],
        "Weight": [100.0, 200.0, 300.0],
        "Vals": [[0.5, 0.5, 0.8],[0.5, 0.5, 0.8], [0.7, 0.9]]
    }
)

┌────────┬────────┬─────────────────┐
│ Group  ┆ Weight ┆ Vals            │
│ ---    ┆ ---    ┆ ---             │
│ str    ┆ f64    ┆ list[f64]       │
╞════════╪════════╪═════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      │
└────────┴────────┴─────────────────┘

My goal is to calculate a 'weighted' column, which would be the multiple of each item in the values list with the value in the weight column:

┌────────┬────────┬─────────────────┬─────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted        │
│ ---    ┆ ---    ┆ ---             ┆ ---             │
│ str    ┆ f64    ┆ list[f64]       ┆ list[i64]       │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50, 50, 80]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100, 100, 160] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210, 270]      │
└────────┴────────┴─────────────────┴─────────────────┘

I've tried a few different things:

df.with_columns(
    pl.col("Vals").list.eval(pl.element() * 3).alias("Weight1"), #Multiplying with literal works
    pl.col("Vals").list.eval(pl.element() * pl.col("Weight")).alias("Weight2"), #Does not work
    pl.col("Vals").list.eval(pl.element() * pl.col("Unknown")).alias("Weight3"), #Unknown columns give same value
    pl.col("Vals").list.eval(pl.col("Vals") * pl.col("Weight")).alias("Weight4"), #Same effect
    # pl.col('Vals') * 3 -> gives an error
)

┌────────┬────────┬────────────┬────────────┬──────────────┬──────────────┬────────────────────┐
│ Group  ┆ Weight ┆ Vals       ┆ Weight1    ┆ Weight2      ┆ Weight3      ┆ Weight4            │
│ ---    ┆ ---    ┆ ---        ┆ ---        ┆ ---          ┆ ---          ┆ ---                │
│ str    ┆ f64    ┆ list[f64]  ┆ list[f64]  ┆ list[f64]    ┆ list[f64]    ┆ list[f64]          │
╞════════╪════════╪════════════╪════════════╪══════════════╪══════════════╪════════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, ┆ [1.5, 1.5, ┆ [0.25, 0.25, ┆ [0.25, 0.25, ┆ [0.25, 0.25, 0.64] │
│        ┆        ┆ 0.8]       ┆ 2.4]       ┆ 0.64]        ┆ 0.64]        ┆                    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, ┆ [1.5, 1.5, ┆ [0.25, 0.25, ┆ [0.25, 0.25, ┆ [0.25, 0.25, 0.64] │
│        ┆        ┆ 0.8]       ┆ 2.4]       ┆ 0.64]        ┆ 0.64]        ┆                    │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9] ┆ [2.1, 2.7] ┆ [0.49, 0.81] ┆ [0.49, 0.81] ┆ [0.49, 0.81]       │
└────────┴────────┴────────────┴────────────┴──────────────┴──────────────┴────────────────────┘

Unless I'm not understanding it correctly, it seems like you're unable to access columns outside of the list from within the eval function. Perhaps there might be a way to use list comprehension within the statement, but that doesn't really seem like a neat solution.

What would be the recommended approach here? Any help would be appreciated!

Upvotes: 6

Answers (3)

roman

Reputation: 117400

updated. Since version 1.10.0 arithmetic operations between lists and scalars are supported:

df.with_columns(
    Weighted = pl.col.Vals * pl.col.Weight
)

shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted              │
│ ---    ┆ ---    ┆ ---             ┆ ---                   │
│ str    ┆ f64    ┆ list[f64]       ┆ list[f64]             │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210.0, 270.0]        │
└────────┴────────┴─────────────────┴───────────────────────┘

outdated. Since version 1.8.0 arithmetic operations on lists are supported. You still need to convert scalar column into list column of appropriate length with pl.Expr.repeat_by() and pl.Expr.list.len().

df.with_columns(
    Weighted = pl.col.Vals * pl.col.Weight.repeat_by(pl.col.Vals.list.len())
)

shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted              │
│ ---    ┆ ---    ┆ ---             ┆ ---                   │
│ str    ┆ f64    ┆ list[f64]       ┆ list[f64]             │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210.0, 270.0]        │
└────────┴────────┴─────────────────┴───────────────────────┘

outdated. Unfortunately, operations on lists are not yet supported, but you can work around it using Struct and avoid exploding and imploding the data. You can create a list out of column a and then get the result by multiplying Structs and converting it back to List.

repeat_by() to create a list of values from column a using list.len() of list in column b.
list.to_struct() to convert list to Struct.
struct.field() to get all the fields of the struct (using *) as columns.
concat_list() to concat the column into the list.
list.drop_nulls() to drop nulls which might appear there due to different lengths of lists.
(optional) cast() to cast resulting list of floats to list of integers.

(
    df
    .with_columns(Weighted = pl.col.Weight.repeat_by(pl.col.Vals.list.len()))
    .with_columns(pl.col.Weighted.list.to_struct() * pl.col.Vals.list.to_struct())
    .with_columns(Weighted = pl.concat_list(pl.col.Weighted.struct.field("*")))
    .with_columns(pl.col.Weighted.list.drop_nulls().cast(pl.List(pl.Int64)))
)

┌────────┬────────┬─────────────────┬─────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted        │
│ ---    ┆ ---    ┆ ---             ┆ ---             │
│ str    ┆ f64    ┆ list[f64]       ┆ list[i64]       │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50, 50, 80]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100, 100, 160] │
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210, 270]      │
└────────┴────────┴─────────────────┴─────────────────┘

Upvotes: 0

alexp

Reputation: 1054

So you can solve this in two ways. To be honest I don't know which one is better, I haven't tested for performance or RAM usage.

Algorithm 1

We can put both columns in a struct and then apply a custom function on them. (Which is explained in the Documentation here)

import polars as pl
import numpy as np

def weighted_list(ls, weight):
    return np.array(ls) * weight

df.with_columns(
    pl.struct("Weight", "Vals").map_elements(
        lambda x: weighted_list(x["Vals"], x["Weight"]), 
        return_dtype=pl.List(pl.Float64)
    )
)

Algorithm 2

So in this case you have to be careful that your columns "Group", "Weight" are unique. So if you have two entries for example with Group3 and Weight 300.

(df.explode("Vals")
   .with_columns((pl.col("Vals") * pl.col("Weight")).alias("Weighted"))
   .group_by("Group", "Weight")
   .agg("Vals", "Weighted")
)

Upvotes: 2

MYK

Reputation: 3007

EDIT - Polars update:

As of the latest version of Polars, this is now a the correct syntax:

df = pl.DataFrame(
    {
        "Group": ["Group1", "Group2", "Group3"],
        "Weight": [100.0, 200.0, 300.0],
        "Vals": [[0.5, 0.5, 0.8],[0.5, 0.5, 0.8], [0.7, 0.9]]
    }
)

(df
    .explode('Vals')
    .with_columns(Weighted = pl.col('Weight')*pl.col('Vals'))
    .group_by('Group')
    .agg(
        pl.col('Weight').first(),                                                                                                             
        pl.col('Vals'),
        pl.col('Weighted')
        )                                                                                                 
)

shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group  ┆ Weight ┆ Vals            ┆ Weighted              │
│ ---    ┆ ---    ┆ ---             ┆ ---                   │
│ str    ┆ f64    ┆ list[f64]       ┆ list[f64]             │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group3 ┆ 300.0  ┆ [0.7, 0.9]      ┆ [210.0, 270.0]        │
│ Group1 ┆ 100.0  ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0]    │
│ Group2 ┆ 200.0  ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
└────────┴────────┴─────────────────┴───────────────────────┘

Upvotes: 3

How to multiply each element in a list with a value in a different column?

Answers (3)

Algorithm 1

Algorithm 2

EDIT - Polars update:

Related Questions