Alphynn
Alphynn

Reputation: 83

Adding a new pandas dataframe columns populating it with conditional calculations (mean if, standard deviation if)

I need to add 2 columns to a Pandas data frame populated with conditional mean and standard deviation.

# Import pandas library 
import pandas as pd 

# Initialize list of lists 
data = [
    ['Tom',   'Blue', 'Small',  10, ], 
    ['Tom',   'Blue', 'Big',    15, ], 
    ['Tom',   'Red',  'Big',     3, ], 
    ['Tom',   'Red',  'Small',   8, ], 
    ['Tom',   'Red',  'Medium',  5, ], 
    ['Tom',   'Blue', 'Medium', 12, ], 
    ['Nick',  'Blue', 'Small',  15, ], 
    ['Nick',  'Red',  'Medium', 11, ], 
    ['Nick',  'Red',  'Small',  17, ], 
    ['Julie', 'Blue', 'Big',    14, ], 
    ['Julie', 'Red',  'Big',    11, ],
    ['Julie', 'Red',  'Medium',  3, ], 
    ['Julie', 'Red',  'Medium',  4, ],
]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Colour', 'Size', 'Age']) 

# print dataframe. 
df

In this example, I need to add 2 columns: df['mean'] and df['std'] conditional to the name and colour only.

I looked at .mean() and .std() documentation but didn't find a way to add a set of conditions. I believe I might need 2 functions that, for each row, takes the name and the colour as arguments, calculate the mean or std over the whole series, and then populate the new columns.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html

I believe I need a function that dynamically checks the name and the colour for each row, because, unlike this example data frame, my application is for a data frame that has thousands and names and thousands of colours.

I know how to apply .mean or .std to a whole pandas series, without conditions:

df['Age'].std()

or

df['Age'].mean()

I also found that a group by + transformation does something slightly similar without the capacity to dynamically check for each row like a function could.

df['mean'] = df.groupby(['Name','Colour']).transform('mean')

Thank you for your help.

Here is the output I expected. Thanks for the answers: Output dataframe

Upvotes: 4

Views: 4361

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150755

Since transform does not support std, we can go around by groupby().agg and merge:

new_df = df.groupby(['Name', 'Colour']).Age.agg({'mean','std'})
df = df.merge(new_df, left_on=['Name', 'Colour'], right_index=True)

Output:

     Name Colour    Size  Age       std       mean
0     Tom   Blue   Small   10  2.516611  12.333333
1     Tom   Blue     Big   15  2.516611  12.333333
5     Tom   Blue  Medium   12  2.516611  12.333333
2     Tom    Red     Big    3  2.516611   5.333333
3     Tom    Red   Small    8  2.516611   5.333333
4     Tom    Red  Medium    5  2.516611   5.333333
6    Nick   Blue   Small   15       NaN  15.000000
7    Nick    Red  Medium   11  4.242641  14.000000
8    Nick    Red   Small   17  4.242641  14.000000
9   Julie   Blue     Big   14       NaN  14.000000
10  Julie    Red     Big   11  4.358899   6.000000
11  Julie    Red  Medium    3  4.358899   6.000000
12  Julie    Red  Medium    4  4.358899   6.000000

Upvotes: 4

Related Questions