Deleting entire rows of a dataset for outliers found in a single column

Question

I am currently trying to remove the outlier values from my dataset, using the median absolute deviation method.

To do so, I followed the instructions given by @tanemaki in Detect and exclude outliers in Pandas data frame, which enables the deletion of entire rows that hold at least one outlier value.

In the post I linked, the same question was asked, but was not answered.

The problem is that I only want the outliers to be searched in a single column.

So, for example, my dataframe looks like:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0        
    4        40.31        9.3        
    5        26.21       15.6
    6        26.59       17.9

For example, there are two 'anomalies in the data:

The Temperature value in row [4]
The Date value in row [5]

So, what I want is for the outlier function to only 'notice' the anomaly in the Temperature column, and delete its corresponding row.

The outlier code I am using is:

df=pd.read_excel(r'/home/.../myfile.xlsx')
from scipy import stats
df[pd.isnull(df)]=0
dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)] #@taneski
print(dfn)

And my resulting data frame currently looks like:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0               
    6        26.59       17.9

In case I am not getting my message across, the desired output would be:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0  
    5        26.21       15.6         
    6        26.59       17.9

Any pointers would be of great help. Thanks!

Mayank Porwal · Accepted Answer

You can always limit the stats.zscore operation on only the Temperature column instead of the whole df. Like this maybe:

In [573]: dfn = df[(np.abs(stats.zscore(df['Temperature']))<4)]                                                                                                                                             

In [574]: dfn                                                                                                                                                                                               
Out[574]: 
   Temperature  Date
1        24.72   2.3
2        25.76   4.6
3        25.42   7.0
5        26.21  15.6
6        26.59  17.9

Deleting entire rows of a dataset for outliers found in a single column

Answers (2)

Related Questions