Reputation: 1391
I have a df that's grouped by id
. For each id
group, I would like to return all the rows where column a
is greater than the mean of a
for that group. I tried the following:
df = pd.DataFrame(data = {"a": np.arange(10), "b": np.arange(10)[::-1], "c": np.random.choice(a = np.arange(10), size = 10)}, index = pd.Index(data = np.random.choice(a = [1,2,3], size = 10), name = "id"))
df.groupby("id").apply(lambda x: x[x.a > x.a.mean()])
This raises a ValueError: Duplicated level name: "id", assigned to level 1, is already used for level 0.
What am I doing wrong?
Upvotes: 4
Views: 3288
Reputation: 863166
Use transform
for same Series
as size of original DataFrame
for better performance like apply
solution:
df = df[df['a'] > df.groupby("id")['a'].transform('mean')]
print (df)
a b c
id
2 5 4 3
2 6 3 2
1 7 2 3
3 8 1 0
3 9 0 5
Detail:
print (df.groupby("id")['a'].transform('mean'))
id
1 4.75
1 4.75
1 4.75
3 3.50
3 3.50
1 4.75
1 4.75
1 4.75
1 4.75
1 4.75
Name: a, dtype: float64
In your solution need parameter group_keys=False
for avoid MultiIndex
with same levels names, because id
in index name:
df = df.groupby("id", group_keys=False).apply(lambda x: x[x.a > x.a.mean()])
If first reset_index()
get column name id
and index name id
, but there are same values:
df = df.reset_index().groupby("id").apply(lambda x: x[x.a > x.a.mean()])
print (df)
id a b c
id
2 6 2 6 3 3
7 2 7 2 9
9 2 9 0 1
3 5 3 5 4 9
8 3 8 1 8
Another test - removing index name
- id
:
df = df.rename_axis(None)
print (df)
a b c
3 0 9 2
2 1 8 2
1 2 7 6
3 3 6 1
1 4 5 3
2 5 4 9
3 6 3 6
2 7 2 1
1 8 1 0
1 9 0 1
df = df.groupby(level=0).apply(lambda x: x[x.a > x.a.mean()])
print (df)
a b c
1 1 8 1 0
1 9 0 1
2 2 5 4 9
2 7 2 1
3 3 6 3 6
Upvotes: 3