kkk
kkk

Reputation: 1920

How to find outliers in a given dataset using python

coordinates = [(259, 168), (62, 133), (143, 163), (174, 270), (321, 385)]

slope = 0.76083799
intercept = 77.87127406

enter image description here

The coordinate with the brown marker is a potential outlier for me and thus need to be removed. As of now i am trying to use the student residual and jackknife residual to remove these outliers. However i am not able to calculate these residuals given the dataset that i have.

It would be really helpful if you people can help me in finding the residuals and how to do it as well in the above dataset.

CODE

import numpy as np
import matplotlib.pyplot as plt

coordinates = [(259, 168), (62, 133), (143, 163), (174, 270), (321, 385)]

x=[x1[0] for x1 in coordinates]
y=[x1[1] for x1 in coordinates]

for x1,y1 in coordinates:
   plt.plot(x1,y1,marker="o",color="brown")
plt.show()

# using numpy polyfit method to find regression line slope and intercept 
z = np.polyfit(x,y,1)
print(z)
slope = z[0]
intercept =z[1]

newx = np.linspace(62,321,200)
newy = np.poly1d(z)
plt.plot(x,y, 'o', newx, newy(newx),color="black")
# plt.plot()
plt.plot(259,168,marker="o",color="brown")
plt.show()

#TODO
#remove the outliers and then display

Upvotes: 0

Views: 2185

Answers (1)

Yarnspinner
Yarnspinner

Reputation: 892

x and y are placed into np.ndarrays at the start.

Input:

import numpy as np
import matplotlib.pyplot as plt

coordinates = [(259, 168), (62, 133), (143, 163), (174, 270), (321, 385)]

x=np.array([x1[0] for x1 in coordinates]) #Placed into array
y=np.array([x1[1] for x1 in coordinates]) #Placed into array

for x1,y1 in coordinates:
   plt.plot(x1,y1,marker="o",color="brown")
plt.show()

# using numpy polyfit method to find regression line slope and intercept 
z = np.polyfit(x,y,1)
print(z)
slope = z[0]
intercept =z[1]

newx = np.linspace(62,321,200)
newy = np.poly1d(z)
plt.plot(x,y, 'o', newx, newy(newx),color="black")
# plt.plot()
plt.plot(259,168,marker="o",color="brown")
plt.show()

Additional code:

print("old y: " + repr(y)) #Display original array of y values
print("old x: " + repr(x)) 
residual_array = abs(y - (intercept + slope * x)) #Create an array of residuals
max_accept_deviation = 100 #An arbitrary value of "acceptable deviation"
mask = residual_array >= max_accept_deviation #Create an array of TRUE/FALSE values. TRUE where residual array is larger than deviation
rows_to_del = tuple(te for te in np.where(mask)[0]) #np.where converts the mask to a list of row numbers which is converted to a tuple
cleaned_y = np.delete(y,rows_to_del) #np.delete deletes all row numbers in the earlier tuple
cleaned_x = np.delete(x,rows_to_del)
print("new y: " + repr(cleaned_y)) #Print the cleaned values
print("new x: " + repr(cleaned_x))

Output:

[  0.76083799  77.87127406]
old y: array([168, 133, 163, 270, 385])
old x: array([259,  62, 143, 174, 321])
new y: array([133, 163, 270, 385])
new x: array([ 62, 143, 174, 321])

Upvotes: 1

Related Questions