Reputation: 2024
I am trying to subset pandas dataframe by a group / category, calculate a statistic and apply it to the original dataframe for missing values in the group.
df1 = pd.DataFrame({
'City': ['SF','NYC','SF','NYC','SF','CHI','LA','LA','CHI'],
'Val': [2,4,0,0,7,4,3,5,6]
})
for name, group in df.groupby(['City']):
dff = df[df['City'] == name]
# Calculate mean
df1 = dff[dff['Val'] != 0]
mean_val = int(df1['Val'].mean())
Now, I need to apply mean_val
to all 0s
in the subset.
Upvotes: 1
Views: 312
Reputation: 35686
We could mask
out the 0 values then groupby transform
to calculate the mean and fillna
to put the means back, lastly convert the column to int
using astype
:
s = df['Val'].mask(df['Val'].eq(0))
df['Val'] = s.fillna(s.groupby(df['City']).transform('mean')).astype(int)
Or we can boolean index where Val is 0, mask
out the 0 values and assign the results of groupby transform
back using loc
:
m = df['Val'].eq(0)
df.loc[m, 'Val'] = (
df['Val'].mask(m)
.groupby(df['City']).transform('mean')
.astype(int)
)
Both produce:
df
:
City Val
0 SF 2
1 NYC 4
2 SF 4
3 NYC 4
4 SF 7
5 CHI 4
6 LA 3
7 LA 5
8 CHI 6
We could filter dff
to get the index locations relative to df
and assign back to modify the original approach:
for name, group in df.groupby(['City']):
dff = df[df['City'] == name]
# Calculate mean
df1 = dff[dff['Val'] != 0]
mean_val = int(df1['Val'].mean())
# Assign mean back to `df` at index locations where Val is 0 in group
df.loc[dff[(dff['Val'] == 0)].index, 'Val'] = mean_val
Although looping is highly discouraged in pandas due to increased runtime.
However, if we are going to use the iterable from groupby
we should use the values returned instead of filtering from df
:
for name, group in df.groupby(['City']):
# Create Boolean Index
m = group['Val'] != 0
# Calculate mean from grouped dataframe `group`
mean_val = int(group.loc[m, 'Val'].mean())
# Assign mean back to `df` at index locations where Val is 0 in group
df.loc[group[~m].index, 'Val'] = mean_val
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
'City': ['SF', 'NYC', 'SF', 'NYC', 'SF', 'CHI', 'LA', 'LA', 'CHI'],
'Val': [2, 4, 0, 0, 7, 4, 3, 5, 6],
})
Upvotes: 2