Aaron
Aaron

Reputation: 2305

Better way to remove statistical outliers than this?

This code works. But I can't help but feel it's a hack, especially the "offset" part. I had to put that in there because otherwise all the index values in deletes are shifted by one every time I do a del operation.

    # remove outliers > devs # of std deviations
    devs = 1
    deletes = []
    for num, duration in enumerate(durations):
        if (duration > (mean_duration + (devs * std_dev_one_test))) or \
            (duration < (mean_duration - (devs * std_dev_one_test))):
            deletes.append(num)
    offset = 0
    for delete in deletes:
        del durations[delete - offset]
        del dates[delete - offset]
        offset += 1

Ideas on how to make it better?

Upvotes: 1

Views: 8306

Answers (4)

Levon
Levon

Reputation: 143017

Is the problem that you are deleting items from a list and it causes the index to shift and you are compensating with an offset?

If that's the case, then just delete form the back to the front, that way as you delete items it won't affect the rest of the list.

So start iterating from the last item to the front of the list.

These SO question might be of interest Delete many elements of list (python) and Python: Removing list element while iterating over list

Another good SO discussion can be found here: Remove items from a list while iterating (thanks to @PaulMcGuire for the suggestion via the comments)

Upvotes: 1

Russell Borogove
Russell Borogove

Reputation: 19037

Build a list of keepers as you iterate over the list:

def isKeeper( duration ):
    if (duration > (mean_duration + (devs * std_dev_one_test))) or \
            (duration < (mean_duration - (devs * std_dev_one_test))):
        return False
    return True

durations = [duration for duration in durations if isKeeper(duration)]

Upvotes: 4

Akavall
Akavall

Reputation: 86128

Maybe something like this:

import numpy as np        

myList = [1,2,3,4,5,6,7,3,4,5,3,5,99] 

mean_duration  = np.mean(myList)
std_dev_one_test = np.std(myList)     

def drop_outliers(x):
    if abs(x - mean_duration) <= std_dev_one_test:
        return x

myList = filter(drop_outliers, myList)

Result:

>>> myList
[1, 2, 3, 4, 5, 6, 7, 3, 4, 5, 3, 5]

Upvotes: 3

Michael Anderson
Michael Anderson

Reputation: 73450

If your data set is small you can just reverse your logic, and keep values instead of deleting them:

# keep value outliers < devs # of std deviations
devs = 1
keeps = []
for duration in durations:
    if (duration <= (mean_duration + (devs * std_dev_one_test))) and \
        (duration >= (mean_duration - (devs * std_dev_one_test))):
        keeps.append(duration)

Upvotes: 0

Related Questions