enricw
enricw

Reputation: 293

Deleting entire rows of a dataset for outliers found in a single column

I am currently trying to remove the outlier values from my dataset, using the median absolute deviation method.

To do so, I followed the instructions given by @tanemaki in Detect and exclude outliers in Pandas data frame, which enables the deletion of entire rows that hold at least one outlier value.

In the post I linked, the same question was asked, but was not answered.

The problem is that I only want the outliers to be searched in a single column.

So, for example, my dataframe looks like:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0        
    4        40.31        9.3        
    5        26.21       15.6
    6        26.59       17.9        
 

For example, there are two 'anomalies in the data:

So, what I want is for the outlier function to only 'notice' the anomaly in the Temperature column, and delete its corresponding row.

The outlier code I am using is:

df=pd.read_excel(r'/home/.../myfile.xlsx')
from scipy import stats
df[pd.isnull(df)]=0
dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)] #@taneski
print(dfn)

And my resulting data frame currently looks like:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0               
    6        26.59       17.9        

In case I am not getting my message across, the desired output would be:


          Temperature    Date       
    1        24.72        2.3        
    2        25.76        4.6        
    3        25.42        7.0  
    5        26.21       15.6         
    6        26.59       17.9        

Any pointers would be of great help. Thanks!

Upvotes: 3

Views: 3110

Answers (2)

Peritract
Peritract

Reputation: 769

At the moment, you're calculating the zscores for the whole dataframe and then filtering the dataframe with those calculated scores; what you want to do is just apply the same idea to one column.

Instead of

dfn=df[(np.abs(stats.zscore(df))<4).all(axis=1)]

You want to have

df[np.abs(stats.zscore(df["Temperature"])) < 4]

As a side note, I found that I was unable to get your example results by comparing the zscores to 4; I had to switch it down to 2.

Upvotes: 3

Mayank Porwal
Mayank Porwal

Reputation: 34086

You can always limit the stats.zscore operation on only the Temperature column instead of the whole df. Like this maybe:

In [573]: dfn = df[(np.abs(stats.zscore(df['Temperature']))<4)]                                                                                                                                             

In [574]: dfn                                                                                                                                                                                               
Out[574]: 
   Temperature  Date
1        24.72   2.3
2        25.76   4.6
3        25.42   7.0
5        26.21  15.6
6        26.59  17.9

Upvotes: 3

Related Questions