Remove outliers before aggregate in Python Pandas

Question

I have a DataFrame with 5 columns. I groupby the first 4 columns and calculate the mean, std and count of the 5th column.

I do this with the following code:

df.groupby(['col1','col2','col3','col4']).agg([np.mean, np.std, len])

Now my question, I have a function which replace the outliers with the mean value. How can I just drop those rows that are outliers?

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        
    return group

    df.groupby(['col1','col2','col3','col4']).transform(replace)

Second question,

When I try to combine both transform and agg, I have the following error:

df.groupby(['col1','col2','col3','col4']).transform(replace).agg([np.mean, np.std, len])

AttributeError: 'DataFrame' object has no attribute 'agg'

HYRY · Accepted Answer

transform() returns a DataFrame which has not agg() method, you need to call groupby() method again. Or you can save the groupby object, and reuse it's grouper attribute.

To drop outliers, you need call apply() and get a boolean series mask, and then select the rows, and call groupby() again.

import pandas as pd
import numpy as np

N = 10000
df = pd.DataFrame(np.random.randint(0, 5, size=(N, 4)), columns=["c1", "c2", "c3", "c4"])
df["c5"] = np.random.randn(N)

def replace(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    return group.where(inliers, mean)

def drop(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    return inliers

g = df.groupby(['c1','c2','c3','c4'])

s1 = g.c5.transform(replace)
res1 = s1.groupby(g.grouper).agg([np.mean, np.std, len])

mask = g.c5.apply(drop)
res2 = df[mask].groupby(['c1','c2','c3','c4']).c5.agg([np.mean, np.std, len])

You can also calculate the agg in callback function:

def func(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    tmp = group[inliers]
    return {"mean":tmp.mean(), "std":tmp.std(), "len":tmp.shape[0]}

g.c5.apply(func).unstack()

Remove outliers before aggregate in Python Pandas

Answers (1)

Related Questions