cphill
cphill

Reputation: 5914

Numpy Pandas Remove Outliers

I am trying to create a function that will parse through an array of values and then update the array without the values that are determined to be outliers by falling outside of the interquartile range. While I feel like the conditions I have in place will test and output the correct answer, I'm not sure how I should pass a column worth of values into the data frame in a way that will remove the rows that match outlier cases. Currently my data comes back with a typeError.

Error: TypeError: tuple indices must be integers, not str

Function:

def reject_outliers_iqr(data):
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1

    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    return np.where((data > upper_bound) > (data < lower_bound))

Dataframe:

rawData = pd.read_csv(parent_folder + "/" + csv_file)
print(rawData.head())

     date day_of_week  leads  clicks  sal
0  1/1/17      Sunday      0     527    0
1  1/2/17      Monday      0    1013    0
2  1/3/17     Tuesday      0    1428    0
3  1/4/17   Wednesday      0    1461    0
4  1/5/17    Thursday      0    1429    0

Upvotes: 1

Views: 1973

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76297

Your function's last line seems to me to contain at least three errors, and should probably be

return np.where((data > lower_bound) & (data < upper_bound))
  1. Data should be higher than the lower bound, and lower than the upper bound.

  2. The logical conjunction is &, not >.

Once corrected, it ran by me fine, as in (using your data):

>>> df.iloc[reject_outliers_iqr(df.clicks)]
    date    day_of_week leads   clicks  sal
0   1/1/17  Sunday  0   527 0
1   1/2/17  Monday  0   1013    0
2   1/3/17  Tuesday 0   1428    0
3   1/4/17  Wednesday   0   1461    0
4   1/5/17  Thursday    0   1429    

Upvotes: 1

Related Questions