nik
nik

Reputation: 2294

Apply function accross multiple columns within group_by in Polars

Given this dataframe:

polars_df = pl.DataFrame({
    "name": ["A","B","C"],
    "group": ["a","a","b"],
    "val1": [1, None, 3],
    "val2": [1, 5, None],
    "val3": [None, None, 3],
})

I want to calculate the mean and count the number of NAs within the three val* columns for each group. So the result should look like:

pl.DataFrame([
    {'group': 'a', 'mean': 2.0, 'percentage_na': 0.5},
    {'group': 'b', 'mean': 3.0, 'percentage_na': 0.3333333333333333}
])

In Pandas I was able to do this with this (quite ugly and not optimized) code:

df = polars_df.to_pandas()

pd.concat([
    df.groupby(["group"]).apply(lambda g: g.filter(like="val").mean().mean()).rename("mean"),
    df.groupby(["group"]).apply(lambda g: g.filter(like="val").isna().sum().sum() / (g.filter(like="val").shape[0] * g.filter(like="val").shape[1])).rename("percentage_na")
], axis=1)

Upvotes: 2

Views: 501

Answers (5)

jqurious
jqurious

Reputation: 21580

Feels like there should be a simpler way - some attempts:

(df.group_by("group")
   .agg(
      pl.mean_horizontal(pl.col(r"^val.+$").mean()).alias("mean"),
      (
         pl.sum_horizontal(pl.col(r"^val.+$").null_count())
         /
         pl.sum_horizontal(pl.col(r"^val.+$").len())
      )
      .alias("percentage_na")
   )
)
shape: (2, 3)
┌───────┬──────┬───────────────┐
│ group | mean | percentage_na │
│ ---   | ---  | ---           │
│ str   | f64  | f64           │
╞═══════╪══════╪═══════════════╡
│ a     | 2.0  | 0.5           │
│ b     | 3.0  | 0.333333      │
└───────┴──────┴───────────────┘

Upvotes: 3

Luca
Luca

Reputation: 1914

Here is my proposal, strongly inspired by @jqurious's first answer.

combined_lists = pl.concat_list(r'^val.+$').over('group', mapping_strategy='join')

df.select(
    'group',
    mean = 
        combined_lists.list.eval(pl.element().list.explode()).list.mean(),
    percentage_na = 
        combined_lists.list.eval(pl.element().list.explode().null_count() 
        / pl.element().list.explode().count()).flatten()
).unique(subset = 'group')

shape: (2, 3)
┌───────┬──────────┬───────────────┐
│ group ┆ mean     ┆ percentage_na │
│ ---   ┆ ---      ┆ ---           │
│ str   ┆ f64      ┆ f64           │
╞═══════╪══════════╪═══════════════╡
│ a     ┆ 2.333333 ┆ 0.5           │
│ b     ┆ 3.0      ┆ 0.333333      │
└───────┴──────────┴───────────────┘

As @Dean MacGregor mentioned, I also understand that the average of group a should be 2.33 instead of 2

Upvotes: 1

nik
nik

Reputation: 2294

With the inspiration of Dean MacGregor and taking your correction that I accidentally calculated the mean of the mean in my pandas code and it actually should be simply the mean over the group I finally came up with this solution:

all_cols_except_val=[x for x in df.columns if "val" not in x]

df.unpivot(index=all_cols_except_val).group_by("group").agg(
    pl.col('value').mean().alias("mean"),
    (pl.col('value').is_null().sum()/pl.col('value').count()).alias("percent_na"),
)

Thanks everyone :)

Upvotes: 1

ignoring_gravity
ignoring_gravity

Reputation: 10563

You could use a unpivot and concat:

pl.concat(
    [
        polars_df.group_by("group")
        .agg(pl.exclude("name").mean())
        .unpivot(index="group")
        .group_by("group")
        .agg(pl.col("value").mean())
        .rename({"value": "mean"}),
        polars_df.group_by("group")
        .agg(pl.exclude("name").is_null().mean())
        .unpivot(index="group")
        .group_by("group")
        .agg(pl.col("value").mean())
        .drop("group")
        .rename({"value": "percentage_na"}),
    ],
    how="horizontal",
)
shape: (2, 3)
┌───────┬──────┬───────────────┐
│ group ┆ mean ┆ percentage_na │
│ ---   ┆ ---  ┆ ---           │
│ str   ┆ f64  ┆ f64           │
╞═══════╪══════╪═══════════════╡
│ a     ┆ 2.0  ┆ 0.5           │
│ b     ┆ 3.0  ┆ 0.333333      │
└───────┴──────┴───────────────┘

Not the simplest, seeing if there's a simpler way

Upvotes: 1

Dean MacGregor
Dean MacGregor

Reputation: 18691

I rolled back my answer to when the answer is 2.33

all_cols_except_val=[x for x in df.columns if "val" not in x]
df.unpivot(index=all_cols_except_val) \
    .group_by('group') \
    .agg(
        mean=pl.col('value').mean(),
        percent_na=pl.col('value').is_null().sum()/pl.col('value').count()
    )


shape: (2, 3)
┌───────┬──────────┬────────────┐
│ group ┆ mean     ┆ percent_na │
│ ---   ┆ ---      ┆ ---        │
│ str   ┆ f64      ┆ f64        │
╞═══════╪══════════╪════════════╡
│ b     ┆ 3.0      ┆ 0.333333   │
│ a     ┆ 2.333333 ┆ 0.5        │
└───────┴──────────┴────────────┘

Upvotes: 2

Related Questions