Reputation: 119
I want to eliminate the outliers in a dataframe that has columns with different dtypes (int64 and object). I need to remove all rows that have outliers in at least one column. So, I tried to use the following code:
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
For each column, this code computes the Z-score for each value by using the column's mean and standard deviation. 'all(axis=1)' guarantees that for each row, all columns satisfy the constraint (absolute value of each z-score is below 3).
However, as some columns' dtype is 'object', I am receiving the following error: TypeError: unsupported operand type(s) for /: 'str' and 'int'
I think this is happening because it is not possible to calculate the z-score in columns that only have strings ('object' dtype). So, I need a code that considers only the numerical columns to detect and eliminate the outliers.
Do you know how to eliminate outliers in a dataframe that has columns with different dtypes (int64 and object)?
This dataframe is about property rentals in Brazil. You can create a sample by using the following code:
data = {
'city': ['São Paulo', 'Rio', 'Recife'],
'area(m2)': [90, 120, 60],
'Rooms': [3, 2, 4],
'Bathrooms': [2, 3, 3],
'animal': ['accept', 'do not accept', 'accept'],
'rent($)': [2000, 3000, 800]
}
df = pd.DataFrame(
data,
columns=['city', 'area(m2)', 'Rooms', 'Bathrooms', 'animal', 'rent($)']
)
print(df)
This is how the sample looks:
city area(m2) Rooms Bathrooms animal rent($)
0 São Paulo 90 3 2 accept 2000
1 Rio 120 2 3 do not accept 3000
2 Recife 60 4 3 accept 800
The original dataset can be found at: https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent
Upvotes: 1
Views: 2781
Reputation: 35696
Try using select_dtypes to get all columns from df
of a particular type.
To select all numeric types, use np.number or 'number'
new_df = df[
(np.abs(stats.zscore(df.select_dtypes(include=np.number))) < 3).all(axis=1)
]
Upvotes: 4
Reputation: 369
You can iterate through the columns and get the dtypes for each column and only calculate outliers if it has the type you want. You can keep a running list of indexes to drop. Something like this.
drop_idx = []
for cols in df:
if df[cols].dtype not in (float, int):
continue
# grab indexes of all outliers, notice that its '>= 3' now
drop_idx.extend(df[np.abs(stats.zscore(df[cols])) >= 3].index))
df = df.drop(set(drop_idx))
Upvotes: 0