Reputation: 5467
I have a DataFrame with a column that has some bad data with various negative values. I would like to replace values < 0 with the mean of the group that they are in.
For missing values as NAs, I would do:
data = df.groupby(['GroupID']).column
data.transform(lambda x: x.fillna(x.mean()))
But how to do this operation on a condition like x < 0
?
Thanks!
Upvotes: 14
Views: 25195
Reputation: 1
Easy-to-remember solution:
df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2], [2, -5]], columns=list('ab'))
df
a b
0 1 1
1 1 -1
2 2 1
3 2 2
4 2 -5
Create groups with positive average values:
replace = df[df.b >0].groupby('a')['b'].mean()
replace
a
1 1.0
2 1.5
Replace negative values:
df.apply(lambda x: replace[x.a] if x.b < 0 else x.b, axis=1)
0 1.0
1 1.0
2 1.0
3 2.0
4 1.5
Upvotes: 0
Reputation: 1
Another approach could be:
df.apply(lambda x: df[df['vals'] > 0].groupby('grps')['vals'].mean().loc[x.grps] if x.vals < 0 else x.vals, axis= 1)
Upvotes: 0
Reputation: 1363
I had the same issue and came up with a rather simple solution
func = lambda x : np.where(x < 0, x.mean(), x)
df['Bad_Column'].transform(func)
Note that if you want to return the mean of the correct values (mean based on positive values only) you'd have to specify:
func = lambda x : np.where(x < 0, x.mask(x < 0).mean(), x)
Upvotes: 2
Reputation: 323326
There is a great Example, for your additional question.
df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})
gb = df.groupby('A')
def replace(g):
mask = g < 0
g.loc[mask] = g[~mask].mean()
return g
gb.transform(replace)
Link: http://pandas.pydata.org/pandas-docs/stable/cookbook.html
Upvotes: 1
Reputation: 880299
Using @AndyHayden's example, you could use groupby
/transform
with replace
:
df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=list('ab'))
print(df)
# a b
# 0 1 1
# 1 1 -1
# 2 2 1
# 3 2 2
data = df.groupby(['a'])
def replace(group):
mask = group<0
# Select those values where it is < 0, and replace
# them with the mean of the values which are not < 0.
group[mask] = group[~mask].mean()
return group
print(data.transform(replace))
# b
# 0 1
# 1 1
# 2 1
# 3 2
Upvotes: 13
Reputation: 375685
Here's one way to do it (for the 'b'
column, in this boring example):
In [1]: df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=list('ab'))
In [2]: df
Out[2]:
a b
0 1 1
1 1 -1
2 2 1
3 2 2
Replace those negative values with NaN, and then calculate the mean (b
) in each group:
In [3]: df['b'] = df.b.apply(lambda x: x if x>=0 else pd.np.nan)
In [4]: m = df.groupby('a').mean().b
Then use apply
across each row, to replace each NaN with its groups mean:
In [5]: df['b'] = df.apply(lambda row: m[row['a']]
if pd.isnull(row['b'])
else row['b'],
axis=1)
In [6]: df
Out[6]:
a b
0 1 1
1 1 1
2 2 1
3 2 2
Upvotes: 3