Reputation: 198
I am following this link to remove outliers, but something is logically wrong here..
Remove Outliers in Pandas DataFrame using Percentiles
I have a dataset with first column as "id" and last column as "label".
Here is my piece of code I am removing label and id columns and then appending it:
def processing_data(train_data,test_data):
#computing percentiles.
low = .05
high = .95
filt_df = train_data.loc[:, train_data.columns != 'id']
filt_df= filt_df.loc[:, filt_df.columns != 'label']
quant_df = filt_df.quantile([low, high])
print(quant_df)
#filtering values based on computed percentiles. To do that use an apply by columns.
print("Before removing outlier",filt_df,filt_df.shape)
train_data1 = filt_df.apply(lambda x: x[(x>=quant_df.loc[low,x.name]) & (x <=quant_df.loc[high,x.name])], axis=0)
print("After removing outlier,",train_data1,train_data1.shape)
print(train_data1.isnull().sum())
train_data1= pd.concat([train_data.loc[:,'id'], train_data1], axis=1)
train_data=pd.concat([train_data.loc[:,'label'], train_data1], axis=1)
#train_data.dropna(inplace=True)
#train_data.fillna(0)
#test_data.fillna(0)
#print(train_data)
#print(np.isnan(train_data).any().sum())
return train_data,test_data
Output: All the rows contain some NaN values and when I do train_data.dropna(inplace=True) all the rows are dropped. Strange!!
How can I fix this? When I concat id and label column after outlier treatment, I feel something is fishy there?
Here is the dataset:
id feature0 feature1 feature2 feature3 feature4 feature249 label
0 25.20824887 -16.7457484 50.86994402 5.593471686 1.188262678 1
1 -86.93144987 0.428227194 2.87483597 -8.064850183 6.056867093 2
2 42.16093367 7.85701304 151.6127571 9.639675583 5.570138511 0
3 20.66694385 8.680641918 -56.44917913 -9.814779803 -2.382979151 1
4 35.9466789 4.57373573 -28.16021186 -6.91297056 4.879375409 0
Upvotes: 4
Views: 2442
Reputation: 334
When I ran your code with your example I got an ValueError. I found this issue which mentions that with float dataframe elements quantile has erratic behavior where it either returns NaNs or a ValueError https://github.com/pandas-dev/pandas/issues/14564 . I think in this case it is the 249 column which is int while rest are floats. when I filt_df = pd.DataFrame(filt_df, dtype=float)
to force all the columns to floats, it ran fine.
The NaNs in each row are what are put in place when you filtered by low and high. Each row in teh example does have at least one value that was outside your .05/.95 boundaries (your data may be much more flattened out than you think). This means that when you dropna and it defaults to 'any' all rows will be removed. You can change the way dropna operates by changing 'any' to 'all' or other option. Probably better to adjust your upper/lower bounds to be more in line with your data's spread. Remember even though your bounds are pretty exclusive with each added column it becomes more and more likely that there will be at least one value in each row that falls outside those bounds.
Upvotes: 0