Reputation: 119
I have a pandas data frame with many columns (>100). I standarized all the columns value so every column is centered at 0 (they have mean 0 and std 1). I want to get rid of all the rows that are below -2 and above 2 taking into account all the columns. With this I mean, lets say in the first column the rows 2,3,4 are outliers and in the second column the rows 3,4,5,6 are outliers. Then I would like to get rid of the rows [2,3,4,5,6].
What I am trying to do is to use a for loop to pass for every column and collect the row index that are outliers and store them in a list. At the end I have a list containing lists with the row index of every column. I get the unique values to obtain the row index I should get rid of. My problem is I don´t know how to slice the data frame so it doesn´t contain these rows. I was thinking in using an %in% operator, but it doesn´t admit the format # list in a list#. I show my code below.
### Getting rid of the outliers
'''
We are going to get rid of the outliers who are outside the range of -2 to 2.
'''
aux_features = features_scaled.values
n_cols = aux_features.shape[1]
n_rows = aux_features.shape[0]
outliers_index = []
for i in range(n_cols):
variable = aux_features[:,i] # We take one column at a time
condition = (variable < -2) | (variable > 2) # We stablish the condition for the outliers
index = np.where(condition)
outliers_index.append(index)
outliers = [j for i in outliers_index for j in i]
outliers_2 = np.array([j for i in outliers for j in i])
unique_index = list(np.unique(outliers_2)) # This is the final list with all the index that contain outliers.
total_index = list(range(n_rows))
aux = (total_index in unique_index)
outliers_2 contain a list with all the row indexes (this includes repetition), then in unique_index I get only the unique values so I end with all the row index that have outliers. I am stuck in this part. If anyone knows how to complete it or have better a idea of how get rid of these outliers (I guess my method would be very time consuming for really large datasets)
Upvotes: 0
Views: 2075
Reputation: 173
df = pd.DataFrame(np.random.standard_normal(size=(1000, 5))) # example data
cleaned = df[~(np.abs(df) > 2).any(1)]
Explanation:
Filter dataframe for values above and below 2. Returns dataframe containing boolean expressions:
np.abs(df) > 2
Check if row contains outliers. Evaluates to True for each row where an outlier exists:
(np.abs(df) > 2).any(1)
Finally select all rows without outlier using the ~
operator:
df[~(np.abs(df) > 2).any(1)]
Upvotes: 1