Reputation: 10051
Say I have a dataframe as follows:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=6, freq='M'),
'value': [3, 3.5, -5, 2, 7, 6.8], 'type': ['a', 'a', 'a', 'b', 'b', 'b']})
df['pct'] = df.groupby(['type'])['value'].pct_change()
Ouput:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -2.428571
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b -0.028571
I want to replace the pct
values which is bigger than 0.2
or smaller than -0.2
, then replace them with groupby type
mean
s.
My attempt to solve this problem by: first, replace "outliers" with extrame values -999
, then replace them by groupby outputs, this is what I have done:
df.loc[df['pct'] >= 0.2, 'pct'] = -999
df.loc[df['pct'] <= -0.2, 'pct'] = -999
df["pct"] = df.groupby(['type'])['pct'].transform(lambda x: x.replace(-999, x.mean()))
But obviously, it is not best solution to solve this problem and results are not correct:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -499.416667
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b -499.514286
5 2013-06-30 6.8 b -0.028571
The expected result should look like this:
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a 0.166667
2 2013-03-31 -5.0 a -1.130
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b 1.24
What I have done wrong? Again thanks for your kind help.
Upvotes: 1
Views: 289
Reputation: 863226
Instead your both conditions is possible use Series.between
and set values in pct
by GroupBy.transform
with mean
:
mask = df['pct'].between(-0.2, 0.2)
df.loc[mask, 'pct'] = df.groupby('type')['pct'].transform('mean').values
print (df)
date value type pct
0 2013-01-31 3.0 a NaN
1 2013-02-28 3.5 a -1.130952
2 2013-03-31 -5.0 a -2.428571
3 2013-04-30 2.0 b NaN
4 2013-05-31 7.0 b 2.500000
5 2013-06-30 6.8 b 1.235714
Alternative solution is use numpy.where
:
mask = df['pct'].between(-0.2, 0.2)
df['pct'] = np.where(mask, df.groupby('type')['pct'].transform('mean'), df['pct'])
Upvotes: 2