Adam
Adam

Reputation: 429

How do you index outliers in python?

I am trying to remove outliers from a list in python. I want to get the index values of each outlier from an original list so I can remove it from (another) corresponding list.

~~Simple example~~

my list with outliers:

y = [1,2,3,4,500] #500 is the outlier; has a index of 4

my corresponding list:

x= [1,2,3,4,5] #I want to remove 5, has the same index of 4

MY RESULT/GOAL:

y=[1,2,3,4]

x=[1,2,3,4]

This is my code, and I want to achieve the same with klist and avglatlist

import numpy as np

klist=['1','2','3','4','5','6','7','8','4000']
avglatlist=['1','2','3','4','5','6','7','8','9']


klist = np.array(klist).astype(np.float)      
klist=klist[(abs(klist - np.mean(klist))) < (2 * np.std(klist))]

indices=[]
for k in klist:
    if (k-np.mean(klist))>((2*np.std(klist))):
        i=klist.index(k)
        indices.append(i)

print('indices'+str(indices))

avglatlist = np.array(avglatlist).astype(np.float) 


for index in sorted(indices, reverse=True):
    del avglatlist[index]


print(len(klist))
print(len(avglatlist))

Upvotes: 1

Views: 2250

Answers (2)

Jarad
Jarad

Reputation: 18913

How to get the index values of each outlier in a list?

Say an outlier is defined as 2 standard deviations from a mean. This means you'd want to know the indices of values in a list where zscores have absolute values greater than 2.

I would use np.where:

import numpy as np
from scipy.stats import zscore

klist = np.array([1, 2, 3, 4, 5, 6, 7, 8, 4000])
avglatlist = np.arange(1, klist.shape[0] + 1)

indices = np.where(np.absolute(zscore(klist)) > 2)[0]
indices_filter = [i for i,n in enumerate(klist) if i not in indices]
print(avglatlist[indices_filter])

If you don't actually need to know the indices, use a boolean mask instead:

import numpy as np
from scipy.stats import zscore

klist = np.array([1, 2, 3, 4, 5, 6, 7, 8, 4000])
avglatlist = np.arange(1, klist.shape[0] + 1)

mask = np.absolute(zscore(klist)) > 2
print(avglatlist[~mask])

Both solutions print:

[1 2 3 4 5 6 7 8]

Upvotes: 1

Andrew Guy
Andrew Guy

Reputation: 9968

You are really close. All you need to do is apply the same filtering regime to a numpy version of avglatlist. I've changed a few variable names for clarity.

import numpy as np

klist = ['1', '2', '3', '4', '5', '6', '7', '8', '4000']
avglatlist = ['1', '2', '3', '4', '5', '6', '7', '8', '9']


klist_np = np.array(klist).astype(np.float)
avglatlist_np = np.array(avglatlist).astype(np.float)    

klist_filtered = klist_np[(abs(klist_np - np.mean(klist_np))) < (2 * np.std(klist_np))]
avglatlist_filtered = avglatlist_np[(abs(klist_np - np.mean(klist_np))) < (2 * np.std(klist_np))]

Upvotes: 0

Related Questions