Anflores
Anflores

Reputation: 119

Getting rid of outliers rows in multiple columns pandas dataframe

I have a pandas data frame with many columns (>100). I standarized all the columns value so every column is centered at 0 (they have mean 0 and std 1). I want to get rid of all the rows that are below -2 and above 2 taking into account all the columns. With this I mean, lets say in the first column the rows 2,3,4 are outliers and in the second column the rows 3,4,5,6 are outliers. Then I would like to get rid of the rows [2,3,4,5,6].

What I am trying to do is to use a for loop to pass for every column and collect the row index that are outliers and store them in a list. At the end I have a list containing lists with the row index of every column. I get the unique values to obtain the row index I should get rid of. My problem is I don´t know how to slice the data frame so it doesn´t contain these rows. I was thinking in using an %in% operator, but it doesn´t admit the format # list in a list#. I show my code below.

### Getting rid of the outliers
'''
We are going to get rid of the outliers who are outside the range of -2 to 2. 
'''                                          
aux_features = features_scaled.values
n_cols = aux_features.shape[1]
n_rows = aux_features.shape[0]
outliers_index = []

for i in range(n_cols):
    variable = aux_features[:,i] # We take one column at a time
    condition = (variable < -2) | (variable > 2) # We stablish the condition for the outliers
    index = np.where(condition)
    outliers_index.append(index)

outliers = [j for i in outliers_index for j in i]

outliers_2 = np.array([j for i in outliers for j in i])
unique_index = list(np.unique(outliers_2)) # This is the final list with all the index that contain outliers.

total_index = list(range(n_rows))

aux = (total_index in unique_index)

outliers_2 contain a list with all the row indexes (this includes repetition), then in unique_index I get only the unique values so I end with all the row index that have outliers. I am stuck in this part. If anyone knows how to complete it or have better a idea of how get rid of these outliers (I guess my method would be very time consuming for really large datasets)

Upvotes: 0

Views: 2075

Answers (1)

chuni0r
chuni0r

Reputation: 173

df = pd.DataFrame(np.random.standard_normal(size=(1000, 5)))  # example data
cleaned = df[~(np.abs(df) > 2).any(1)]  

Explanation:

Filter dataframe for values above and below 2. Returns dataframe containing boolean expressions:

np.abs(df) > 2

Check if row contains outliers. Evaluates to True for each row where an outlier exists:

(np.abs(df) > 2).any(1) 

Finally select all rows without outlier using the ~ operator:

 df[~(np.abs(df) > 2).any(1)]  

Upvotes: 1

Related Questions