Reputation: 293
I am currently trying to remove the outlier values from my dataset, using the median absolute deviation method.
To do so, I followed the instructions given by @tanemaki in Detect and exclude outliers in Pandas data frame, which enables the deletion of entire rows that hold at least one outlier value.
In the post I linked, the same question was asked, but was not answered.
The problem is that I only want the outliers to be searched in a single column.
So, for example, my dataframe looks like:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
4 40.31 9.3
5 26.21 15.6
6 26.59 17.9
For example, there are two 'anomalies in the data:
So, what I want is for the outlier function to only 'notice' the anomaly in the Temperature column, and delete its corresponding row.
The outlier code I am using is:
df=pd.read_excel(r'/home/.../myfile.xlsx')
from scipy import stats
df[pd.isnull(df)]=0
dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)] #@taneski
print(dfn)
And my resulting data frame currently looks like:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
6 26.59 17.9
In case I am not getting my message across, the desired output would be:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
5 26.21 15.6
6 26.59 17.9
Any pointers would be of great help. Thanks!
Upvotes: 3
Views: 3110
Reputation: 769
At the moment, you're calculating the zscores for the whole dataframe and then filtering the dataframe with those calculated scores; what you want to do is just apply the same idea to one column.
Instead of
dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)]
You want to have
df[np.abs(stats.zscore(df["Temperature"])) < 4]
As a side note, I found that I was unable to get your example results by comparing the zscores to 4; I had to switch it down to 2.
Upvotes: 3
Reputation: 34086
You can always limit the stats.zscore
operation on only the Temperature
column instead of the whole df
. Like this maybe:
In [573]: dfn = df[(np.abs(stats.zscore(df['Temperature']))<4)]
In [574]: dfn
Out[574]:
Temperature Date
1 24.72 2.3
2 25.76 4.6
3 25.42 7.0
5 26.21 15.6
6 26.59 17.9
Upvotes: 3