Reputation: 301
Is there a way to write a function in Python where it reads in a numpy two-dimensional array, finds the index values for any outliers, and then returns an array with those index values?
This is what I have so far. I tried using the Z-score method:
import numpy as np
def function(arrayMatrix):
threshold = 3
mean_y = np.mean(arrayMatrix)
stdev_y = np.std(arrayMatrix)
z_scores = [(y - mean_y) / stdev_y for y in arrayMatrix]
return np.where(np.abs(z_scores) > threshold)
def main():
MatrixOne = np.array([[1,2,10],[1,10,2]])
print(function(MatrixOne))
MatrixTwo = np.array([[1,2,3,4,20],[1,20,2,3,4],[20,2,3,4,5]])
print(function(MatrixTwo))
main()
The results would be:
[2 1]
[4 1 0]
My results are:
(array([], dtype=int32), array([], dtype=int32))
(array([], dtype=int32), array([], dtype=int32))
Upvotes: 1
Views: 2602
Reputation: 226
Outlier is a set of measured values with a deviation of more than two standard deviations from the mean, and a deviation of more than three standard deviations from the mean. In your case you could define the difference passing standard deviation as an outlier.
Try this:
import numpy as np
def main():
MatrixOne = np.array([[1,2,10],[1,10,2]])
print(function(MatrixOne))
MatrixTwo = np.array([[1,2,3,4,20],[1,20,2,3,4],[20,2,3,4,5]])
print(function(MatrixTwo))
MatrixThree = np.array([[1,10,2,8,5],[2,7,3,9,11],[19,2,1,1,5]])
print(function(MatrixThree))
def function(arrayMatrix):
arraystd=np.std(arrayMatrix,1,ddof=1,keepdims=True)
arraymean=np.mean(arrayMatrix,1)[:, np.newaxis]
arrayoutlier=np.transpose(np.where(np.abs(arrayMatrix-arraymean)>(arraystd)))#or 2*arraystd)
return arrayoutlier
main()
Output:
[[0 2]
[1 1]]
[[0 4]
[1 1]
[2 0]]
[[0 0]
[0 1]
[1 0]
[1 4]
[2 0]]
The index returned by the program is the dimension coordinates.
Upvotes: 0
Reputation: 3043
You have asked a very good question. You can use the interquartile range (IQR) method of removing outliers using python. =)
Check this code out. You can adjust the variable named outlierConstant
to increase (or decrease) your tolerance for outliers. I have chosen outlierConstant=0.5
for the example that I am giving here.
import numpy as np
# iqr is a function which returns indices of outliers in each row/1d array
def iqr(a, outlierConstant):
"""
a : numpy.ndarray (array from which outliers have to be removed.)
outlierConstant : (scale factor around interquartile region.)
"""
num = a.shape[0]
upper_quartile = np.percentile(a, 75)
lower_quartile = np.percentile(a, 25)
IQR = (upper_quartile - lower_quartile) * outlierConstant
quartileSet = (lower_quartile - IQR, upper_quartile + IQR)
outlier_indx = []
for i in range(num):
if a[i] >= quartileSet[0] and a[i] <= quartileSet[1]: pass
else: outlier_indx += [i]
return outlier_indx
def function(arr):
lst = []
for i in range(arr.shape[0]):
lst += iqr(a = arr[i,:], outlierConstant=0.5)
return lst
def main():
MatrixOne = np.array([[1,2,10],[1,10,2]])
print(function(MatrixOne))
MatrixTwo = np.array([[1,2,3,4,20],[1,20,2,3,4],[20,2,3,4,5]])
print(function(MatrixTwo))
main()
Output
[2, 1]
[4, 1, 0]
Upvotes: 3
Reputation: 13999
Your math is fine (though you'll need to set threshold=1
to get the result you want), but your use of Numpy arrays is a little off. Here how you can fix your code:
import numpy as np
def function(arrayMatrix, threshold=1):
zscore = (arrayMatrix - arrayMatrix.mean())/arrayMatrix.std()
return np.where(np.abs(zscore) > threshold)
def main():
MatrixOne = np.array([[1,2,10],[1,10,2]])
print(function(MatrixOne))
MatrixTwo = np.array([[1,2,3,4,20],[1,20,2,3,4],[20,2,3,4,5]])
print(function(MatrixTwo))
MatrixThree = np.array([[1,10,2,8,5],[2,7,3,9,11],[19,2,1,1,5]])
print(function(MatrixThree))
main()
This outputs:
(array([0, 1]), array([2, 1]))
(array([0, 1, 2]), array([4, 1, 0]))
(array([1, 2]), array([4, 0]))
Where the first array in each line is the row indexes of the outlier, and the second array is the column indices. So, for example, the first line in the output tells you that the the outlier in MatrixOne
are at:
outliers = [MatrixOne[0,2], MatrixOne[1,1]]
Upvotes: 0