Reputation: 1445
I have a script that generates a pandas data frame with a varying number of value columns. As an example, this df might be
import pandas as pd
df = pd.DataFrame({
'group': ['A', 'A', 'A', 'B', 'B'],
'group_color' : ['green', 'green', 'green', 'blue', 'blue'],
'val1': [5, 2, 3, 4, 5],
'val2' : [4, 2, 8, 5, 7]
})
group group_color val1 val2
0 A green 5 4
1 A green 2 2
2 A green 3 8
3 B blue 4 5
4 B blue 5 7
My goal is to get the grouped mean for each of the value columns. In this specific case (with 2 value columns), I can use
df.groupby('group').agg({"group_color": "first", "val1": "mean", "val2": "mean"})
group_color val1 val2
group
A green 3.333333 4.666667
B blue 4.500000 6.000000
but that does not work when the data frame in question has more value columns (val3, val4 etc.). Is there a way to dynamically take the mean of "all the other columns" or "all columns containing val in their names"?
Upvotes: 25
Views: 5497
Reputation: 294576
Per OP's comment
We can group by both 'group'
and 'group_color'
without the risk of there being more than one unique 'group_color'
per 'group'
Consequently:
df.groupby(['group', 'group_color']).mean().reset_index(level=1)
group_color val1 val2
group
A green 3.333333 4.666667
B blue 4.500000 6.000000
Upvotes: 4
Reputation: 27899
You can go with 2 dictionaries that you can combine like this:
df.groupby('group').agg({**{'group_color': 'first'}, **{c: 'mean' for c in df.columns if c.startswith('val')}})
In this case you have one dict
with fixed aggregations and other with dynamic column selection.
Upvotes: 2
Reputation: 88305
Unfortunately you will have to apply both aggregation functions separately (that or repeat "valn": "mean"
as many times as valx
columns). Groupby.agg
can take a dictionary but the keys must be individual columns.
The way I'd do this is using DataFrame.filter
to select the subset of the dataframe with the columns following the format of valx
, aggregate with the mean, and then assign new columns with the aggregated results on the other columns:
(df.filter(regex=r'^val').groupby(df.group).mean()
.assign(color = df.group_color.groupby(df.group).first()))
val1 val2 color
group
A 3.333333 4.666667 green
B 4.500000 6.000000 blue
Upvotes: 6
Reputation: 323396
More easy like
df.groupby('group').agg(lambda x : x.head(1) if x.dtype=='object' else x.mean())
Out[63]:
group_color val1 val2
group
A green 3.333333 4.666667
B blue 4.500000 6.000000
Upvotes: 16
Reputation: 150825
If your group_color
is always the same within one group, you can do:
df.pivot_table(index=['group','group_color'],aggfunc='mean')
Output:
val1 val2
group group_color
A green 3.333333 4.666667
B blue 4.500000 6.000000
In the other case, you can build the dictionary and pass it to agg
:
agg_dict = {f: 'first' if f=='group_color' else 'mean' for f in df.columns[1:]}
df.groupby('group').agg(agg_dict)
Which output:
group_color val1 val2
group
A green 3.333333 4.666667
B blue 4.500000 6.000000
Upvotes: 9