Reputation: 2294
Given this dataframe:
polars_df = pl.DataFrame({
"name": ["A","B","C"],
"group": ["a","a","b"],
"val1": [1, None, 3],
"val2": [1, 5, None],
"val3": [None, None, 3],
})
I want to calculate the mean and count the number of NAs within the three val* columns for each group. So the result should look like:
pl.DataFrame([
{'group': 'a', 'mean': 2.0, 'percentage_na': 0.5},
{'group': 'b', 'mean': 3.0, 'percentage_na': 0.3333333333333333}
])
In Pandas I was able to do this with this (quite ugly and not optimized) code:
df = polars_df.to_pandas()
pd.concat([
df.groupby(["group"]).apply(lambda g: g.filter(like="val").mean().mean()).rename("mean"),
df.groupby(["group"]).apply(lambda g: g.filter(like="val").isna().sum().sum() / (g.filter(like="val").shape[0] * g.filter(like="val").shape[1])).rename("percentage_na")
], axis=1)
Upvotes: 2
Views: 501
Reputation: 21580
Feels like there should be a simpler way - some attempts:
(df.group_by("group")
.agg(
pl.mean_horizontal(pl.col(r"^val.+$").mean()).alias("mean"),
(
pl.sum_horizontal(pl.col(r"^val.+$").null_count())
/
pl.sum_horizontal(pl.col(r"^val.+$").len())
)
.alias("percentage_na")
)
)
shape: (2, 3)
┌───────┬──────┬───────────────┐
│ group | mean | percentage_na │
│ --- | --- | --- │
│ str | f64 | f64 │
╞═══════╪══════╪═══════════════╡
│ a | 2.0 | 0.5 │
│ b | 3.0 | 0.333333 │
└───────┴──────┴───────────────┘
Upvotes: 3
Reputation: 1914
Here is my proposal, strongly inspired by @jqurious's first answer.
combined_lists = pl.concat_list(r'^val.+$').over('group', mapping_strategy='join')
df.select(
'group',
mean =
combined_lists.list.eval(pl.element().list.explode()).list.mean(),
percentage_na =
combined_lists.list.eval(pl.element().list.explode().null_count()
/ pl.element().list.explode().count()).flatten()
).unique(subset = 'group')
shape: (2, 3)
┌───────┬──────────┬───────────────┐
│ group ┆ mean ┆ percentage_na │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═══════╪══════════╪═══════════════╡
│ a ┆ 2.333333 ┆ 0.5 │
│ b ┆ 3.0 ┆ 0.333333 │
└───────┴──────────┴───────────────┘
As @Dean MacGregor mentioned, I also understand that the average of group a should be 2.33 instead of 2
Upvotes: 1
Reputation: 2294
With the inspiration of Dean MacGregor and taking your correction that I accidentally calculated the mean of the mean in my pandas code and it actually should be simply the mean over the group I finally came up with this solution:
all_cols_except_val=[x for x in df.columns if "val" not in x]
df.unpivot(index=all_cols_except_val).group_by("group").agg(
pl.col('value').mean().alias("mean"),
(pl.col('value').is_null().sum()/pl.col('value').count()).alias("percent_na"),
)
Thanks everyone :)
Upvotes: 1
Reputation: 10563
You could use a unpivot
and concat
:
pl.concat(
[
polars_df.group_by("group")
.agg(pl.exclude("name").mean())
.unpivot(index="group")
.group_by("group")
.agg(pl.col("value").mean())
.rename({"value": "mean"}),
polars_df.group_by("group")
.agg(pl.exclude("name").is_null().mean())
.unpivot(index="group")
.group_by("group")
.agg(pl.col("value").mean())
.drop("group")
.rename({"value": "percentage_na"}),
],
how="horizontal",
)
shape: (2, 3)
┌───────┬──────┬───────────────┐
│ group ┆ mean ┆ percentage_na │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═══════╪══════╪═══════════════╡
│ a ┆ 2.0 ┆ 0.5 │
│ b ┆ 3.0 ┆ 0.333333 │
└───────┴──────┴───────────────┘
Not the simplest, seeing if there's a simpler way
Upvotes: 1
Reputation: 18691
I rolled back my answer to when the answer is 2.33
all_cols_except_val=[x for x in df.columns if "val" not in x]
df.unpivot(index=all_cols_except_val) \
.group_by('group') \
.agg(
mean=pl.col('value').mean(),
percent_na=pl.col('value').is_null().sum()/pl.col('value').count()
)
shape: (2, 3)
┌───────┬──────────┬────────────┐
│ group ┆ mean ┆ percent_na │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═══════╪══════════╪════════════╡
│ b ┆ 3.0 ┆ 0.333333 │
│ a ┆ 2.333333 ┆ 0.5 │
└───────┴──────────┴────────────┘
Upvotes: 2