Reputation: 5914
I am trying to create a function that will parse through an array of values and then update the array without the values that are determined to be outliers by falling outside of the interquartile range. While I feel like the conditions I have in place will test and output the correct answer, I'm not sure how I should pass a column worth of values into the data frame in a way that will remove the rows that match outlier cases. Currently my data comes back with a typeError.
Error: TypeError: tuple indices must be integers, not str
Function:
def reject_outliers_iqr(data):
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
lower_bound = q1 - (iqr * 1.5)
upper_bound = q3 + (iqr * 1.5)
return np.where((data > upper_bound) > (data < lower_bound))
Dataframe:
rawData = pd.read_csv(parent_folder + "/" + csv_file)
print(rawData.head())
date day_of_week leads clicks sal
0 1/1/17 Sunday 0 527 0
1 1/2/17 Monday 0 1013 0
2 1/3/17 Tuesday 0 1428 0
3 1/4/17 Wednesday 0 1461 0
4 1/5/17 Thursday 0 1429 0
Upvotes: 1
Views: 1973
Reputation: 76297
Your function's last line seems to me to contain at least three errors, and should probably be
return np.where((data > lower_bound) & (data < upper_bound))
Data should be higher than the lower bound, and lower than the upper bound.
The logical conjunction is &
, not >
.
Once corrected, it ran by me fine, as in (using your data):
>>> df.iloc[reject_outliers_iqr(df.clicks)]
date day_of_week leads clicks sal
0 1/1/17 Sunday 0 527 0
1 1/2/17 Monday 0 1013 0
2 1/3/17 Tuesday 0 1428 0
3 1/4/17 Wednesday 0 1461 0
4 1/5/17 Thursday 0 1429
Upvotes: 1