Tasos
Tasos

Reputation: 7587

Remove outliers before aggregate in Python Pandas

I have a DataFrame with 5 columns. I groupby the first 4 columns and calculate the mean, std and count of the 5th column.

I do this with the following code:

df.groupby(['col1','col2','col3','col4']).agg([np.mean, np.std, len])

Now my question, I have a function which replace the outliers with the mean value. How can I just drop those rows that are outliers?

def replace(group):
    mean, std = group.mean(), group.std()
    outliers = (group - mean).abs() > 3*std
    group[outliers] = mean        
    return group

    df.groupby(['col1','col2','col3','col4']).transform(replace)

Second question,

When I try to combine both transform and agg, I have the following error:

df.groupby(['col1','col2','col3','col4']).transform(replace).agg([np.mean, np.std, len])

AttributeError: 'DataFrame' object has no attribute 'agg'

Upvotes: 0

Views: 1878

Answers (1)

HYRY
HYRY

Reputation: 97331

transform() returns a DataFrame which has not agg() method, you need to call groupby() method again. Or you can save the groupby object, and reuse it's grouper attribute.

To drop outliers, you need call apply() and get a boolean series mask, and then select the rows, and call groupby() again.

import pandas as pd
import numpy as np

N = 10000
df = pd.DataFrame(np.random.randint(0, 5, size=(N, 4)), columns=["c1", "c2", "c3", "c4"])
df["c5"] = np.random.randn(N)

def replace(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    return group.where(inliers, mean)

def drop(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    return inliers

g = df.groupby(['c1','c2','c3','c4'])

s1 = g.c5.transform(replace)
res1 = s1.groupby(g.grouper).agg([np.mean, np.std, len])

mask = g.c5.apply(drop)
res2 = df[mask].groupby(['c1','c2','c3','c4']).c5.agg([np.mean, np.std, len])

You can also calculate the agg in callback function:

def func(group):
    mean, std = group.mean(), group.std()
    inliers = (group - mean).abs() <= 2*std
    tmp = group[inliers]
    return {"mean":tmp.mean(), "std":tmp.std(), "len":tmp.shape[0]}

g.c5.apply(func).unstack()

Upvotes: 2

Related Questions