Reputation: 93
I have a dataframe with a certain number of groups, containing a weight column and a list of values, which can be of arbitrary length, so for example:
df = pl.DataFrame(
{
"Group": ["Group1", "Group2", "Group3"],
"Weight": [100.0, 200.0, 300.0],
"Vals": [[0.5, 0.5, 0.8],[0.5, 0.5, 0.8], [0.7, 0.9]]
}
)
┌────────┬────────┬─────────────────┐
│ Group ┆ Weight ┆ Vals │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] │
╞════════╪════════╪═════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] │
└────────┴────────┴─────────────────┘
My goal is to calculate a 'weighted' column, which would be the multiple of each item in the values list with the value in the weight column:
┌────────┬────────┬─────────────────┬─────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weighted │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[i64] │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] ┆ [50, 50, 80] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] ┆ [100, 100, 160] │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [210, 270] │
└────────┴────────┴─────────────────┴─────────────────┘
I've tried a few different things:
df.with_columns(
pl.col("Vals").list.eval(pl.element() * 3).alias("Weight1"), #Multiplying with literal works
pl.col("Vals").list.eval(pl.element() * pl.col("Weight")).alias("Weight2"), #Does not work
pl.col("Vals").list.eval(pl.element() * pl.col("Unknown")).alias("Weight3"), #Unknown columns give same value
pl.col("Vals").list.eval(pl.col("Vals") * pl.col("Weight")).alias("Weight4"), #Same effect
# pl.col('Vals') * 3 -> gives an error
)
┌────────┬────────┬────────────┬────────────┬──────────────┬──────────────┬────────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weight1 ┆ Weight2 ┆ Weight3 ┆ Weight4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[f64] ┆ list[f64] ┆ list[f64] ┆ list[f64] │
╞════════╪════════╪════════════╪════════════╪══════════════╪══════════════╪════════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, ┆ [1.5, 1.5, ┆ [0.25, 0.25, ┆ [0.25, 0.25, ┆ [0.25, 0.25, 0.64] │
│ ┆ ┆ 0.8] ┆ 2.4] ┆ 0.64] ┆ 0.64] ┆ │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, ┆ [1.5, 1.5, ┆ [0.25, 0.25, ┆ [0.25, 0.25, ┆ [0.25, 0.25, 0.64] │
│ ┆ ┆ 0.8] ┆ 2.4] ┆ 0.64] ┆ 0.64] ┆ │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [2.1, 2.7] ┆ [0.49, 0.81] ┆ [0.49, 0.81] ┆ [0.49, 0.81] │
└────────┴────────┴────────────┴────────────┴──────────────┴──────────────┴────────────────────┘
Unless I'm not understanding it correctly, it seems like you're unable to access columns outside of the list from within the eval function. Perhaps there might be a way to use list comprehension within the statement, but that doesn't really seem like a neat solution.
What would be the recommended approach here? Any help would be appreciated!
Upvotes: 6
Views: 3917
Reputation: 117400
updated.
Since version 1.10.0
arithmetic operations between lists and scalars are supported:
df.with_columns(
Weighted = pl.col.Vals * pl.col.Weight
)
shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weighted │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[f64] │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [210.0, 270.0] │
└────────┴────────┴─────────────────┴───────────────────────┘
outdated. Since version 1.8.0
arithmetic operations on lists are supported. You still need to convert scalar column into list column of appropriate length with pl.Expr.repeat_by()
and pl.Expr.list.len()
.
df.with_columns(
Weighted = pl.col.Vals * pl.col.Weight.repeat_by(pl.col.Vals.list.len())
)
shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weighted │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[f64] │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [210.0, 270.0] │
└────────┴────────┴─────────────────┴───────────────────────┘
outdated. Unfortunately, operations on lists are not yet supported, but you can work around it using Struct and avoid exploding and imploding the data. You can create a list out of column a
and then get the result by multiplying Structs and converting it back to List.
repeat_by()
to create a list of values from column a
using list.len()
of list in column b
.list.to_struct()
to convert list to Struct.struct.field()
to get all the fields of the struct (using *
) as columns.concat_list()
to concat the column into the list.list.drop_nulls()
to drop nulls which might appear there due to different lengths of lists.cast()
to cast resulting list of floats to list of integers.(
df
.with_columns(Weighted = pl.col.Weight.repeat_by(pl.col.Vals.list.len()))
.with_columns(pl.col.Weighted.list.to_struct() * pl.col.Vals.list.to_struct())
.with_columns(Weighted = pl.concat_list(pl.col.Weighted.struct.field("*")))
.with_columns(pl.col.Weighted.list.drop_nulls().cast(pl.List(pl.Int64)))
)
┌────────┬────────┬─────────────────┬─────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weighted │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[i64] │
╞════════╪════════╪═════════════════╪═════════════════╡
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] ┆ [50, 50, 80] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] ┆ [100, 100, 160] │
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [210, 270] │
└────────┴────────┴─────────────────┴─────────────────┘
Upvotes: 0
Reputation: 1054
So you can solve this in two ways. To be honest I don't know which one is better, I haven't tested for performance or RAM usage.
We can put both columns in a struct and then apply a custom function on them. (Which is explained in the Documentation here)
import polars as pl
import numpy as np
def weighted_list(ls, weight):
return np.array(ls) * weight
df.with_columns(
pl.struct("Weight", "Vals").map_elements(
lambda x: weighted_list(x["Vals"], x["Weight"]),
return_dtype=pl.List(pl.Float64)
)
)
So in this case you have to be careful that your columns "Group", "Weight" are unique. So if you have two entries for example with Group3 and Weight 300.
(df.explode("Vals")
.with_columns((pl.col("Vals") * pl.col("Weight")).alias("Weighted"))
.group_by("Group", "Weight")
.agg("Vals", "Weighted")
)
Upvotes: 2
Reputation: 3007
As of the latest version of Polars, this is now a the correct syntax:
df = pl.DataFrame(
{
"Group": ["Group1", "Group2", "Group3"],
"Weight": [100.0, 200.0, 300.0],
"Vals": [[0.5, 0.5, 0.8],[0.5, 0.5, 0.8], [0.7, 0.9]]
}
)
(df
.explode('Vals')
.with_columns(Weighted = pl.col('Weight')*pl.col('Vals'))
.group_by('Group')
.agg(
pl.col('Weight').first(),
pl.col('Vals'),
pl.col('Weighted')
)
)
shape: (3, 4)
┌────────┬────────┬─────────────────┬───────────────────────┐
│ Group ┆ Weight ┆ Vals ┆ Weighted │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[f64] ┆ list[f64] │
╞════════╪════════╪═════════════════╪═══════════════════════╡
│ Group3 ┆ 300.0 ┆ [0.7, 0.9] ┆ [210.0, 270.0] │
│ Group1 ┆ 100.0 ┆ [0.5, 0.5, 0.8] ┆ [50.0, 50.0, 80.0] │
│ Group2 ┆ 200.0 ┆ [0.5, 0.5, 0.8] ┆ [100.0, 100.0, 160.0] │
└────────┴────────┴─────────────────┴───────────────────────┘
Upvotes: 3