Reputation: 83
I need to add 2 columns to a Pandas data frame populated with conditional mean and standard deviation.
# Import pandas library
import pandas as pd
# Initialize list of lists
data = [
['Tom', 'Blue', 'Small', 10, ],
['Tom', 'Blue', 'Big', 15, ],
['Tom', 'Red', 'Big', 3, ],
['Tom', 'Red', 'Small', 8, ],
['Tom', 'Red', 'Medium', 5, ],
['Tom', 'Blue', 'Medium', 12, ],
['Nick', 'Blue', 'Small', 15, ],
['Nick', 'Red', 'Medium', 11, ],
['Nick', 'Red', 'Small', 17, ],
['Julie', 'Blue', 'Big', 14, ],
['Julie', 'Red', 'Big', 11, ],
['Julie', 'Red', 'Medium', 3, ],
['Julie', 'Red', 'Medium', 4, ],
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Colour', 'Size', 'Age'])
# print dataframe.
df
In this example, I need to add 2 columns:
df['mean']
and df['std']
conditional to the name and colour only.
I looked at .mean()
and .std()
documentation but didn't find a way to add a set of conditions. I believe I might need 2 functions that, for each row, takes the name and the colour as arguments, calculate the mean or std over the whole series, and then populate the new columns.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html
I believe I need a function that dynamically checks the name and the colour for each row, because, unlike this example data frame, my application is for a data frame that has thousands and names and thousands of colours.
I know how to apply .mean
or .std
to a whole pandas series, without conditions:
df['Age'].std()
or
df['Age'].mean()
I also found that a group by + transformation does something slightly similar without the capacity to dynamically check for each row like a function could.
df['mean'] = df.groupby(['Name','Colour']).transform('mean')
Thank you for your help.
Here is the output I expected. Thanks for the answers: Output dataframe
Upvotes: 4
Views: 4361
Reputation: 150755
Since transform
does not support std
, we can go around by groupby().agg
and merge
:
new_df = df.groupby(['Name', 'Colour']).Age.agg({'mean','std'})
df = df.merge(new_df, left_on=['Name', 'Colour'], right_index=True)
Output:
Name Colour Size Age std mean
0 Tom Blue Small 10 2.516611 12.333333
1 Tom Blue Big 15 2.516611 12.333333
5 Tom Blue Medium 12 2.516611 12.333333
2 Tom Red Big 3 2.516611 5.333333
3 Tom Red Small 8 2.516611 5.333333
4 Tom Red Medium 5 2.516611 5.333333
6 Nick Blue Small 15 NaN 15.000000
7 Nick Red Medium 11 4.242641 14.000000
8 Nick Red Small 17 4.242641 14.000000
9 Julie Blue Big 14 NaN 14.000000
10 Julie Red Big 11 4.358899 6.000000
11 Julie Red Medium 3 4.358899 6.000000
12 Julie Red Medium 4 4.358899 6.000000
Upvotes: 4