AlphaX
AlphaX

Reputation: 67

ValueError when trying to remove outlier in pandas

I have a dataset where I need to remove some huge outliers (10x the regular data) but I can't figure out a smart way to do it. I tried

if df['pickup_latitude'] >= 3*df['pickup_latitude'].mean():
   df['pickup_latitude'] = df['pickup_latitude'].mean()

But that gives me: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I have tried other methods

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]

but they don't work because I have timestamps on my data which break the other solutions.

Any smart way to filter the outliers away or replace them with other values?

Upvotes: 3

Views: 218

Answers (1)

AChervony
AChervony

Reputation: 663

TL;DR

You need to provide a Boolean vector to identify the data frame cells you are trying to re-assign. In your case change outliers and erroneous data to the average (impute).
I would do it in several steps:

df = pd.DataFrame([0,1,3,'blah',4,5,'blah'], columns = ['pickup_latitude'])
# Identify the numeric observations
numeric = df['pickup_latitude'].astype(str).str.isdigit()
# Calculate mean
mean = df.loc[numeric,'pickup_latitude'].mean()
# Impute non numeric values
df.loc[~numeric,'pickup_latitude'] = mean
# Impute outliers
df.loc[df['pickup_latitude'] >= mean, 'pickup_latitude'] = mean


df['pickup_latitude']
Out[81]: 
0      0
1      1
2    2.6
3    2.6
4    2.6
5    2.6
6    2.6
Name: pickup_latitude, dtype: object

I would also look deep into cleaning the data.


Intuitive explanation:

I don't think it won't impute because of a data integrity issue like timestamps in numeric data. I was able to replicate the first error you described.

You cannot do this:

import pandas as pd
df = pd.DataFrame([0,1,3,4,5], columns = ['pickup_latitude'])
if df['pickup_latitude'] >= df['pickup_latitude'].mean():
   df['pickup_latitude'] = df['pickup_latitude'].mean()

The code tries to compare a series with a constant:

df['pickup_latitude']
Out[12]: 
0    0
1    1
2    3
3    4
4    5
Name: pickup_latitude, dtype: int64

df['pickup_latitude'].mean()
Out[13]: 2.6

if df['pickup_latitude'] >= df['pickup_latitude'].mean():
   df['pickup_latitude'] = df['pickup_latitude'].mean()


Traceback (most recent call last):

  File "<ipython-input-15-1135c8386dd6>", line 1, in <module>
    if df['pickup_latitude'] >= df['pickup_latitude'].mean():

  File "C:\Users\____\AppData\Local\Continuum\anaconda3\envs\DS\lib\site-packages\pandas\core\generic.py", line 1121, in __nonzero__
    .format(self.__class__.__name__))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

The second error is peculiar to your data. I would investigate why different data types reside in the same column (numeric and timestamp).

Upvotes: 2

Related Questions