kms
kms

Reputation: 2024

Subset / group by pandas Data Frame to calculate mean and apply to missing values

I am trying to subset pandas dataframe by a group / category, calculate a statistic and apply it to the original dataframe for missing values in the group.

df1 = pd.DataFrame({
                    'City': ['SF','NYC','SF','NYC','SF','CHI','LA','LA','CHI'], 
                    'Val': [2,4,0,0,7,4,3,5,6] 
                   })

for name, group in df.groupby(['City']):

    dff = df[df['City'] == name]

    # Calculate mean
    df1 = dff[dff['Val'] != 0]
    mean_val = int(df1['Val'].mean())

Now, I need to apply mean_val to all 0s in the subset.

Upvotes: 1

Views: 312

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35686

We could mask out the 0 values then groupby transform to calculate the mean and fillna to put the means back, lastly convert the column to int using astype:

s = df['Val'].mask(df['Val'].eq(0))
df['Val'] = s.fillna(s.groupby(df['City']).transform('mean')).astype(int)

Or we can boolean index where Val is 0, mask out the 0 values and assign the results of groupby transform back using loc:

m = df['Val'].eq(0)
df.loc[m, 'Val'] = (
    df['Val'].mask(m)
        .groupby(df['City']).transform('mean')
        .astype(int)
)

Both produce: df:

  City  Val
0   SF    2
1  NYC    4
2   SF    4
3  NYC    4
4   SF    7
5  CHI    4
6   LA    3
7   LA    5
8  CHI    6

We could filter dff to get the index locations relative to df and assign back to modify the original approach:

for name, group in df.groupby(['City']):

    dff = df[df['City'] == name]

    # Calculate mean
    df1 = dff[dff['Val'] != 0]
    mean_val = int(df1['Val'].mean())
    # Assign mean back to `df` at index locations where Val is 0 in group
    df.loc[dff[(dff['Val'] == 0)].index, 'Val'] = mean_val

Although looping is highly discouraged in pandas due to increased runtime.

However, if we are going to use the iterable from groupby we should use the values returned instead of filtering from df:

for name, group in df.groupby(['City']):
    # Create Boolean Index
    m = group['Val'] != 0
    # Calculate mean from grouped dataframe `group`
    mean_val = int(group.loc[m, 'Val'].mean())
    # Assign mean back to `df` at index locations where Val is 0 in group
    df.loc[group[~m].index, 'Val'] = mean_val

DataFrame and imports:

import pandas as pd

df = pd.DataFrame({
    'City': ['SF', 'NYC', 'SF', 'NYC', 'SF', 'CHI', 'LA', 'LA', 'CHI'],
    'Val': [2, 4, 0, 0, 7, 4, 3, 5, 6],
})

Upvotes: 2

Related Questions