Angel Lira
Angel Lira

Reputation: 413

Removal of outliers using numpy.argwhere

Hey guys this question might be more about logic than code, hopefully someone can light it up.
So, I have a data list that contains some outliers, and I want to remove it by using the difference between each item on the list and identifying where the difference is far too big.
From this example, I want to remove from the data list the indexes[2,3,4]. What is the best way to do it??
I have tried to use np.argwhere() method to find the indexes, however, I am stuck on how to use the result of it to slice a np.array??

data=[4.0, 4.5, 22.5, 40.5, 22.5, 3.5, 3.0, 3.5, 4.5, 3.5, 2.5]
data=np.array(data)
d = data[:-1] - data[1:]
print(np.mean(d)) 

In this example, when I print the difference (d) it returns me this:

print(d) # returns:[ -0.5 -18.  -18.   18.   19.    0.5  -0.5  -1.    1.    1. ]

That is good. Now, the logic I applied was to indicate where in d we have a number higher than the average of the original data.

x = np.argwhere(d>np.mean(data))
print(x)        # returns: array([3], dtype=int64), array([4], dtype=int64)
indices_to_extract = [x[0]-1,x[-1]]
print(indices_to_extract)      # returns: [array([2], dtype=int64), array([[4]], dtype=int64)]
a1 = np.delete(r,indices_to_extract,axis=0)
print(a1)       #returns: [ 4.   4.5 40.5  3.5  3.   3.5  4.5  3.5  2.5]


 #Desirable return:
[ 4.   4.5 3.5  3.  3.5  4.5  3.5  2.5]

Main question is, how to make the result from np.argwhere() range of number that can be used for slicing??

Upvotes: 1

Views: 400

Answers (3)

Ahana Kk
Ahana Kk

Reputation: 1

To use np.argwhere() for a range of numbers say [3,20] in your case you an use:

x = np.argwhere((data<20) & (data>3))

To return array less/greater than a number (say data below 20) you can simply use:

data[np.where(data<20)]

and for a range of numbers say [3,20]:

data[np.where((data<20)&(data>3))]

Upvotes: 0

Ehsan
Ehsan

Reputation: 12397

I would advise using normalized distances to median which is more robust:

d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
print(data[s < 4])

You can change the threshold (here 4 in the last line) to your desire accuracy.

output:

[4.  4.5 3.5 3.  3.5 4.5 3.5 2.5]

Upvotes: 1

DavideBrex
DavideBrex

Reputation: 2414

The problem with taking the difference between items of the list is that for instance the value with index 1 (4.5) will be considered as outlier (it gets an high value with the difference). Also you can get both positive and negative values when taking the difference, so if you want to do it in that way you should apply the module (abs) on the result of the difference.

A way to spot outliers is the follow:

Compute the z-score:

d = (data - np.mean(data)) / np.std(data)

Select every value from data except for the outliers (above the 75% quantile):

data[np.where( ~(d > np.quantile(d, 0.75)))]

Output:

array([4. , 4.5, 3.5, 3. , 3.5, 4.5, 3.5, 2.5])

Upvotes: 1

Related Questions