Santiago Ramirez
Santiago Ramirez

Reputation: 39

Erasing outliers from a dataframe in python

For an assignment I have to erase the outliers of a csv based on the different method

I tried working with the variable 'height' of the csv after opening the csv into a panda dataframe, but it keeps giving me errors or not touching the outliers at all, all this trying to use KNN method in python

The code that I wrote is the following

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs


df = pd.read_csv("data.csv")

print(df.describe())
print(df.columns)

df['height'].plot(kind='hist')
print(df['height'].value_counts())

data= pd.DataFrame(df['height'],df['active'])

k=1
knn = NearestNeighbors(n_neighbors=k)
knn.fit([df['height']])
neighbors_and_distances = knn.kneighbors([df['height']])
knn_distances = neighbors_and_distances[0]
tnn_distance = np.mean(knn_distances, axis=1)
print(knn_distances)
PCM = df.plot(kind='scatter', x='x', y='y', c=tnn_distance, colormap='viridis')
plt.show()

And the data it something like this:

id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,50,64.0,130,70,3,1,0,0,0,1
3,17623,2,250,82.0,150,100,1,1,0,0,1,1

I dont know what Im missing or doing wrong

Upvotes: 0

Views: 93

Answers (1)

Mateusz Dorobek
Mateusz Dorobek

Reputation: 761

df = pd.read_csv("data.csv")
X = df[['height', 'weight']]
X.plot(kind='scatter', x='weight', y='height', colormap='viridis')
plt.show()

enter image description here

knn = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = knn.kneighbors(X)
X['distances'] = distances[:,1]
X.distances
0       1.000000
1       1.000000
2       1.000000
3       3.000000
4       1.000000
5       1.000000
6     133.958949
7     100.344407
       ...
X.plot(kind='scatter', x='weight', y='height', c='distances', colormap='viridis')
plt.show()

enter image description here

MAX_DIST = 10
X[distances < MAX_DIST]
    height  weight
0   162 78.0
1   162 78.0
2   151 76.0
3   151 76.0
4   171 84.0
...

And finally to filter out all the outliers:

MAX_DIST = 10
X = X[X.distances < MAX_DIST]

Upvotes: 1

Related Questions