Reputation: 7587
I have a DataFrame with 5 columns. I groupby the first 4 columns and calculate the mean, std and count of the 5th column.
I do this with the following code:
df.groupby(['col1','col2','col3','col4']).agg([np.mean, np.std, len])
Now my question, I have a function which replace the outliers with the mean value. How can I just drop those rows that are outliers?
def replace(group):
mean, std = group.mean(), group.std()
outliers = (group - mean).abs() > 3*std
group[outliers] = mean
return group
df.groupby(['col1','col2','col3','col4']).transform(replace)
Second question,
When I try to combine both transform and agg, I have the following error:
df.groupby(['col1','col2','col3','col4']).transform(replace).agg([np.mean, np.std, len])
AttributeError: 'DataFrame' object has no attribute 'agg'
Upvotes: 0
Views: 1878
Reputation: 97331
transform()
returns a DataFrame
which has not agg()
method, you need to call groupby()
method again. Or you can save the groupby object, and reuse it's grouper
attribute.
To drop outliers, you need call apply()
and get a boolean series mask
, and then select the rows, and call groupby()
again.
import pandas as pd
import numpy as np
N = 10000
df = pd.DataFrame(np.random.randint(0, 5, size=(N, 4)), columns=["c1", "c2", "c3", "c4"])
df["c5"] = np.random.randn(N)
def replace(group):
mean, std = group.mean(), group.std()
inliers = (group - mean).abs() <= 2*std
return group.where(inliers, mean)
def drop(group):
mean, std = group.mean(), group.std()
inliers = (group - mean).abs() <= 2*std
return inliers
g = df.groupby(['c1','c2','c3','c4'])
s1 = g.c5.transform(replace)
res1 = s1.groupby(g.grouper).agg([np.mean, np.std, len])
mask = g.c5.apply(drop)
res2 = df[mask].groupby(['c1','c2','c3','c4']).c5.agg([np.mean, np.std, len])
You can also calculate the agg in callback function:
def func(group):
mean, std = group.mean(), group.std()
inliers = (group - mean).abs() <= 2*std
tmp = group[inliers]
return {"mean":tmp.mean(), "std":tmp.std(), "len":tmp.shape[0]}
g.c5.apply(func).unstack()
Upvotes: 2