user2480542
user2480542

Reputation: 2945

remove outliers from a 2D list

I have a list of dictionaries like:

t = [{k: 1, 'a': 22, 'b': 59}, {k: 2, 'a': 21, 'b': 34}, {'k': 3, 'a': 991, 'b': 29}, {'k': 4, 'a': 45, 'b': 11}, {'k': 5, 'a'; 211, 'b': 77}, {'k': 6, 'a': 100, 'b': 1024}]

How do I remove outliers from it so that I can have everything that is centered around some meaningful value or it doesnt have values which are way too large or small?

Thanks.

Upvotes: 1

Views: 4037

Answers (2)

Chris Hagmann
Chris Hagmann

Reputation: 1096

The code below finds the point that is farthest from the mean, removes it, then it checks the mean again. If removing the point causes the mean to move less than a given tolerance (by way of a percent change from the old mean) then the move is rejected and that old list is returned. Otherwise, the new list is kept and the process continues.

t = [{'a': 22, 'b': 59, 'k': 1},
 {'a': 21, 'b': 34, 'k': 2},
 {'a': 991, 'b': 29, 'k': 3},
 {'a': 45, 'b': 11, 'k': 4},
 {'a': 211, 'b': 77, 'k': 5},
 {'a': 100, 'b': 1024, 'k': 6}]

K = [te['k'] for te in t]
A = [te['a'] for te in t]
B = [te['b'] for te in t]

data = zip(K,A,B)

def mean(A):
    return sum(A)/float(len(A))

def max_deviation(A):
    mu = mean(A)
    dev = [(a, abs(a-mu)) for a in A]
    dev.sort(key=lambda k: k[1], reverse=True)
    return dev[0][0]

def remove_outliers(A, tol=.3):
    mu = mean(A)
    A_prime = list(a for a in A if a != max_deviation(A))
    mu_prime = mean(A_prime)
    if abs(mu_prime - mu)/float(mu) > tol:
        return remove_outliers(A_prime, tol)
    else:
        return A

t_prime = [dict(k=k, a=a, b=b) for k, a, b in data 
           if a in remove_outliers(A) and b in remove_outliers(B)]

>>> print t_prime
[{'a': 22, 'b': 59, 'k': 1},
 {'a': 21, 'b': 34, 'k': 2},
 {'a': 45, 'b': 11, 'k': 4}]

EDIT: This might scale better as it is removing one value instead of creating N-1 values. This would modify the original A vector. If you don't want that then the first option would be your best choice or send in a copy to begin with.

def remove_outliers(A, tol=.3):
    mu = mean(A)
    out = max_deviation(A)
    A.remove(out)
    mu_prime = mean(A)
    if abs(mu_prime - mu)/float(mu) > tol:
        return remove_outliers(A, tol)
    else:
        A.append(out)
        return A

Upvotes: 2

perimosocordiae
perimosocordiae

Reputation: 17847

As a starting point, you can turn your data into a record array:

import numpy as np
t = [{'k': 1, 'a': 22, 'b': 59}, {'k': 2, 'a': 21, 'b': 34}, {'k': 3, 'a': 991, 'b': 29}, {'k': 4, 'a': 45, 'b': 11}, {'k': 5, 'a': 211, 'b': 77}, {'k': 6, 'a': 100, 'b': 1024}]
foo = np.core.records.fromrecords([x.values() for x in t], names=t[0].keys())

This enables some easier analysis:

In [34]: foo.a.mean(), foo.a.std()
Out[34]: (231.66666666666666, 345.81674659018785)

In [35]: foo.b.mean(), foo.b.std()
Out[35]: (205.66666666666666, 366.58590019560518)

Perhaps you could look for outliers with a boxplot?

from matplotlib import pyplot
pyplot.boxplot([foo.a, foo.b])
pyplot.show()

Or, you could find the values within the 90th percentile of the data:

In [40]: foo.a[foo.a < np.percentile(foo.a, 90)]
Out[40]: array([ 22,  21,  45, 211, 100])

And to select the non-outlier k values:

outlier_mask = (foo.a < np.percentile(foo.a, 90)) & (foo.b < np.percentile(foo.b, 90))
foo.k[outlier_mask]

Of course, how you decide which values are outliers is up to you.

Upvotes: 3

Related Questions