Replace outliers in a mixed dataframe with pandas

Question

I have a mixed dataframe with both str, int and float types. I have some outliers in the floats columns and tried to replace them to NaN using

df.mask(df.sub(df.mean()).div(df.std()).abs().gt(2))

I've also tried with numpy's

v = df.values
mask = np.abs((v - v.mean(0)) / v.std(0)) > 2
pd.DataFrame(np.where(mask, np.nan, v), df.index, df.columns)`

But for both I'm getting TypeError: unsupported operand type(s) for -: 'str' and 'float' and TypeError: must be str, not float

I've also tried to apply this only for the column with the outliers but it's not modifying anything

This is what the df looks like

    dateRep     cases   deaths  countriesAndTerritories     countryterritoryCode    popData2018 
0   03/05/2020  134.0   4.0     Afghanistan     AFG     37172386.0
1   02/05/2020  164.0   4.0     Afghanistan     AFG     37172386.0
2   01/05/2020  222.0   NaN     Afghanistan     AFG     37172386.0
3   30/04/2020  122.0   0.0     Afghanistan     AFG     37172386.0
4   29/04/2020  124.0   3.0     Afghanistan     AFG     37172386.0

Ashwin · Accepted Answer

You could try something like this (this is to change the "cases" column):

df.loc[abs(df.cases - df.cases.mean())/df.cases.std() > 1, "cases"] = None

However, note that here I have used a Z value of 1 for the "Cases" column, since the largest Z value is 1.63 (instance with index = 2). You are trying to modify values with Z values greater than 2, none of these instances have a Z value greater than 2.

Hope this helps!

Replace outliers in a mixed dataframe with pandas

Answers (1)

Related Questions