DreamerP
DreamerP

Reputation: 198

Removing outliers from pandas data frame using percentile

I am following this link to remove outliers, but something is logically wrong here..

Remove Outliers in Pandas DataFrame using Percentiles

I have a dataset with first column as "id" and last column as "label".

Here is my piece of code I am removing label and id columns and then appending it:

def processing_data(train_data,test_data):
    #computing percentiles.
    low = .05
    high = .95
    filt_df = train_data.loc[:, train_data.columns != 'id']
    filt_df= filt_df.loc[:, filt_df.columns != 'label']
    quant_df = filt_df.quantile([low, high])
    print(quant_df)

    #filtering values based on computed percentiles. To do that use an apply by columns.
    print("Before removing outlier",filt_df,filt_df.shape)
    train_data1 = filt_df.apply(lambda x: x[(x>=quant_df.loc[low,x.name]) & (x <=quant_df.loc[high,x.name])], axis=0)
    print("After removing outlier,",train_data1,train_data1.shape)
    print(train_data1.isnull().sum())
    train_data1= pd.concat([train_data.loc[:,'id'], train_data1], axis=1)
    train_data=pd.concat([train_data.loc[:,'label'], train_data1], axis=1)
    #train_data.dropna(inplace=True)

    #train_data.fillna(0)
    #test_data.fillna(0)
    #print(train_data)
    #print(np.isnan(train_data).any().sum())
    return train_data,test_data

Output: All the rows contain some NaN values and when I do train_data.dropna(inplace=True) all the rows are dropped. Strange!!

How can I fix this? When I concat id and label column after outlier treatment, I feel something is fishy there?

Here is the dataset:

id  feature0    feature1    feature2    feature3    feature4    feature249  label
0   25.20824887 -16.7457484 50.86994402 5.593471686 1.188262678   1
1   -86.93144987    0.428227194 2.87483597  -8.064850183    6.056867093     2 
2   42.16093367 7.85701304  151.6127571 9.639675583 5.570138511             0
3   20.66694385 8.680641918 -56.44917913    -9.814779803    -2.382979151    1
4   35.9466789  4.57373573  -28.16021186    -6.91297056 4.879375409         0

Upvotes: 4

Views: 2442

Answers (1)

user85779
user85779

Reputation: 334

When I ran your code with your example I got an ValueError. I found this issue which mentions that with float dataframe elements quantile has erratic behavior where it either returns NaNs or a ValueError https://github.com/pandas-dev/pandas/issues/14564 . I think in this case it is the 249 column which is int while rest are floats. when I filt_df = pd.DataFrame(filt_df, dtype=float) to force all the columns to floats, it ran fine.

The NaNs in each row are what are put in place when you filtered by low and high. Each row in teh example does have at least one value that was outside your .05/.95 boundaries (your data may be much more flattened out than you think). This means that when you dropna and it defaults to 'any' all rows will be removed. You can change the way dropna operates by changing 'any' to 'all' or other option. Probably better to adjust your upper/lower bounds to be more in line with your data's spread. Remember even though your bounds are pretty exclusive with each added column it becomes more and more likely that there will be at least one value in each row that falls outside those bounds.

Upvotes: 0

Related Questions