Nicole Goebel
Nicole Goebel

Reputation: 75

Remove outliers (+/- 3 std) and replace with np.nan in Python/pandas

I have seen several solutions that come close to solving my problem

link1 link2

but they have not helped me succeed thus far.

I believe that the following solution is what I need, but continue to get an error (and I don't have the reputation points to comment/question on it): link

(I get the following error, but I don't understand where to .copy() or add an "inplace=True" when administering the following command df2=df.groupby('install_site').transform(replace):

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: link

SO, I have attempted to come up with my own version, but I keep getting stuck. Here goes.

I have a data frame indexed by time with columns for site (string values for many different sites) and float values.

time_index            site       val

I would like to go through the 'val' column, grouped by site, and replace any outliers (those +/- 3 standard deviations from the mean) with a NaN (for each group).

When I use the following function, I cannot index the data frame with my vector of True/Falses:

def replace_outliers_with_nan(df, stdvs):
    dfnew=pd.DataFrame()
    for i, col in enumerate(df.sites.unique()):
        dftmp = pd.DataFrame(df[df.sites==col])
        idx = [np.abs(dftmp-dftmp.mean())<=(stdvs*dftmp.std())] #boolean vector of T/F's
        dftmp[idx==False]=np.nan  #this is where the problem lies, I believe
        dfnew[col] = dftmp
    return dfnew

In addition, I fear the above function will take a very long time on 7 million+ rows, which is why I was hoping to use the groupby function option.

Upvotes: 2

Views: 12206

Answers (1)

RickardSjogren
RickardSjogren

Reputation: 4238

If I have understood you right, there is no need to iterate over the columns. This solution replaces all values which deviates more than three group standard deviations with NaN.

def replace(group, stds):
    group[np.abs(group - group.mean()) > stds * group.std()] = np.nan
    return group

# df is your DataFrame
df.loc[:, df.columns != group_column] = df.groupby(group_column).transform(lambda g: replace(g, 3))

Upvotes: 6

Related Questions