Kardinol
Kardinol

Reputation: 301

Python Function to return Index of outlier values in two-dimensional numpy array

Is there a way to write a function in Python where it reads in a numpy two-dimensional array, finds the index values for any outliers, and then returns an array with those index values?

This is what I have so far. I tried using the Z-score method:

import numpy as np

def function(arrayMatrix):
    threshold = 3
    mean_y = np.mean(arrayMatrix)
    stdev_y = np.std(arrayMatrix)
    z_scores = [(y - mean_y) / stdev_y for y in arrayMatrix]
    return np.where(np.abs(z_scores) > threshold)



def main():
    MatrixOne = np.array([[1,2,10],[1,10,2]])   
    print(function(MatrixOne))

    MatrixTwo = np.array([[1,2,3,4,20],[1,20,2,3,4],[20,2,3,4,5]])
    print(function(MatrixTwo))

main()

The results would be:

[2 1]
[4 1 0]

My results are:

(array([], dtype=int32), array([], dtype=int32))
(array([], dtype=int32), array([], dtype=int32))

Upvotes: 1

Views: 2602

Answers (3)

myhaspldeep
myhaspldeep

Reputation: 226

Outlier is a set of measured values with a deviation of more than two standard deviations from the mean, and a deviation of more than three standard deviations from the mean. In your case you could define the difference passing standard deviation as an outlier.

Try this:

import numpy as np

def main():
    MatrixOne = np.array([[1,2,10],[1,10,2]])   
    print(function(MatrixOne))

    MatrixTwo = np.array([[1,2,3,4,20],[1,20,2,3,4],[20,2,3,4,5]])
    print(function(MatrixTwo))

    MatrixThree = np.array([[1,10,2,8,5],[2,7,3,9,11],[19,2,1,1,5]]) 
    print(function(MatrixThree))   



def function(arrayMatrix):
    arraystd=np.std(arrayMatrix,1,ddof=1,keepdims=True)
    arraymean=np.mean(arrayMatrix,1)[:, np.newaxis]
    arrayoutlier=np.transpose(np.where(np.abs(arrayMatrix-arraymean)>(arraystd)))#or 2*arraystd)
    return arrayoutlier

main()

Output:

   [[0 2]
 [1 1]]
[[0 4]
 [1 1]
 [2 0]]
[[0 0]
 [0 1]
 [1 0]
 [1 4]
 [2 0]]

The index returned by the program is the dimension coordinates.

Upvotes: 0

Siddharth Satpathy
Siddharth Satpathy

Reputation: 3043

You have asked a very good question. You can use the interquartile range (IQR) method of removing outliers using python. =)

Check this code out. You can adjust the variable named outlierConstant to increase (or decrease) your tolerance for outliers. I have chosen outlierConstant=0.5 for the example that I am giving here.

import numpy as np

# iqr is a function which returns indices of outliers in each row/1d array
def iqr(a, outlierConstant):
    """
    a : numpy.ndarray (array from which outliers have to be removed.)
    outlierConstant : (scale factor around interquartile region.)                         
    """
    num = a.shape[0]

    upper_quartile = np.percentile(a, 75)
    lower_quartile = np.percentile(a, 25)
    IQR = (upper_quartile - lower_quartile) * outlierConstant
    quartileSet = (lower_quartile - IQR, upper_quartile + IQR)

    outlier_indx = []
    for i in range(num):
        if a[i] >= quartileSet[0] and a[i] <= quartileSet[1]: pass
        else: outlier_indx += [i]            

    return outlier_indx  


def function(arr):
    lst = []
    for i in range(arr.shape[0]):
        lst += iqr(a = arr[i,:], outlierConstant=0.5) 
    return lst

def main():
    MatrixOne = np.array([[1,2,10],[1,10,2]])   
    print(function(MatrixOne))

    MatrixTwo = np.array([[1,2,3,4,20],[1,20,2,3,4],[20,2,3,4,5]])
    print(function(MatrixTwo))

main()

Output

[2, 1]
[4, 1, 0]

Upvotes: 3

tel
tel

Reputation: 13999

Your math is fine (though you'll need to set threshold=1 to get the result you want), but your use of Numpy arrays is a little off. Here how you can fix your code:

import numpy as np

def function(arrayMatrix, threshold=1):
    zscore = (arrayMatrix - arrayMatrix.mean())/arrayMatrix.std()
    return np.where(np.abs(zscore) > threshold)

def main():
    MatrixOne = np.array([[1,2,10],[1,10,2]])   
    print(function(MatrixOne))

    MatrixTwo = np.array([[1,2,3,4,20],[1,20,2,3,4],[20,2,3,4,5]])
    print(function(MatrixTwo))

    MatrixThree = np.array([[1,10,2,8,5],[2,7,3,9,11],[19,2,1,1,5]])
    print(function(MatrixThree))

main()

This outputs:

(array([0, 1]), array([2, 1]))
(array([0, 1, 2]), array([4, 1, 0]))
(array([1, 2]), array([4, 0]))

Where the first array in each line is the row indexes of the outlier, and the second array is the column indices. So, for example, the first line in the output tells you that the the outlier in MatrixOne are at:

outliers = [MatrixOne[0,2], MatrixOne[1,1]]

Upvotes: 0

Related Questions