Reputation: 2305
This code works. But I can't help but feel it's a hack, especially the "offset" part. I had to put that in there because otherwise all the index values in deletes are shifted by one every time I do a del operation.
# remove outliers > devs # of std deviations
devs = 1
deletes = []
for num, duration in enumerate(durations):
if (duration > (mean_duration + (devs * std_dev_one_test))) or \
(duration < (mean_duration - (devs * std_dev_one_test))):
deletes.append(num)
offset = 0
for delete in deletes:
del durations[delete - offset]
del dates[delete - offset]
offset += 1
Ideas on how to make it better?
Upvotes: 1
Views: 8306
Reputation: 143017
Is the problem that you are deleting items from a list and it causes the index to shift and you are compensating with an offset?
If that's the case, then just delete form the back to the front, that way as you delete items it won't affect the rest of the list.
So start iterating from the last item to the front of the list.
These SO question might be of interest Delete many elements of list (python) and Python: Removing list element while iterating over list
Another good SO discussion can be found here: Remove items from a list while iterating (thanks to @PaulMcGuire for the suggestion via the comments)
Upvotes: 1
Reputation: 19037
Build a list of keepers as you iterate over the list:
def isKeeper( duration ):
if (duration > (mean_duration + (devs * std_dev_one_test))) or \
(duration < (mean_duration - (devs * std_dev_one_test))):
return False
return True
durations = [duration for duration in durations if isKeeper(duration)]
Upvotes: 4
Reputation: 86128
Maybe something like this:
import numpy as np
myList = [1,2,3,4,5,6,7,3,4,5,3,5,99]
mean_duration = np.mean(myList)
std_dev_one_test = np.std(myList)
def drop_outliers(x):
if abs(x - mean_duration) <= std_dev_one_test:
return x
myList = filter(drop_outliers, myList)
Result:
>>> myList
[1, 2, 3, 4, 5, 6, 7, 3, 4, 5, 3, 5]
Upvotes: 3
Reputation: 73450
If your data set is small you can just reverse your logic, and keep values instead of deleting them:
# keep value outliers < devs # of std deviations
devs = 1
keeps = []
for duration in durations:
if (duration <= (mean_duration + (devs * std_dev_one_test))) and \
(duration >= (mean_duration - (devs * std_dev_one_test))):
keeps.append(duration)
Upvotes: 0