Removing outliers from pandas data frame using percentile

Question

I am following this link to remove outliers, but something is logically wrong here..

Remove Outliers in Pandas DataFrame using Percentiles

I have a dataset with first column as "id" and last column as "label".

Here is my piece of code I am removing label and id columns and then appending it:

def processing_data(train_data,test_data):
    #computing percentiles.
    low = .05
    high = .95
    filt_df = train_data.loc[:, train_data.columns != 'id']
    filt_df= filt_df.loc[:, filt_df.columns != 'label']
    quant_df = filt_df.quantile([low, high])
    print(quant_df)

    #filtering values based on computed percentiles. To do that use an apply by columns.
    print("Before removing outlier",filt_df,filt_df.shape)
    train_data1 = filt_df.apply(lambda x: x[(x>=quant_df.loc[low,x.name]) & (x <=quant_df.loc[high,x.name])], axis=0)
    print("After removing outlier,",train_data1,train_data1.shape)
    print(train_data1.isnull().sum())
    train_data1= pd.concat([train_data.loc[:,'id'], train_data1], axis=1)
    train_data=pd.concat([train_data.loc[:,'label'], train_data1], axis=1)
    #train_data.dropna(inplace=True)

    #train_data.fillna(0)
    #test_data.fillna(0)
    #print(train_data)
    #print(np.isnan(train_data).any().sum())
    return train_data,test_data

Output: All the rows contain some NaN values and when I do train_data.dropna(inplace=True) all the rows are dropped. Strange!!

How can I fix this? When I concat id and label column after outlier treatment, I feel something is fishy there?

Here is the dataset:

id  feature0    feature1    feature2    feature3    feature4    feature249  label
0   25.20824887 -16.7457484 50.86994402 5.593471686 1.188262678   1
1   -86.93144987    0.428227194 2.87483597  -8.064850183    6.056867093     2 
2   42.16093367 7.85701304  151.6127571 9.639675583 5.570138511             0
3   20.66694385 8.680641918 -56.44917913    -9.814779803    -2.382979151    1
4   35.9466789  4.57373573  -28.16021186    -6.91297056 4.879375409         0

Removing outliers from pandas data frame using percentile

Answers (1)

Related Questions