Reputation: 514
It's straightforward to keep the last N rows for every group in a dataframe with something like df.groupby('ID').tail(N)
.
In my case, groups have different sizes and I would like to keep the same % of each group rather than same number of rows.
e.g if we want to keep the last 50% rows for each group (based on ID) for the following :
df = pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'],
'value' : [1,2,10,11,12,13,14,15]})
The result would be :
pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'],
'value' : [2,13,14,15]})
How can we get to that ?
EDIT : If x% is not an int, we round to the smallest closer int.
Upvotes: 6
Views: 1516
Reputation: 41487
groupby
-apply
-tail
Pass the desired size to tail()
in a GroupBy.apply()
. This is simpler than the iloc
method below since it cleanly handles the "last 0 rows" case.
ratio = 0.6
(df.groupby('ID')
.apply(lambda x: x.tail(int(ratio * len(x))))
.reset_index(drop=True))
# ID value
# 0 A 2
# 1 B 13
# 2 B 14
# 3 B 15
ratio = 0.4
(df.groupby('ID')
.apply(lambda x: x.tail(int(ratio * len(x))))
.reset_index(drop=True))
# ID value
# 0 B 14
# 1 B 15
groupby
-apply
-iloc
Alternatively, index the desired size via iloc
/slicing, but this is clunkier since [-0:]
does not actually get the last 0 rows, so we have to check against that:
ratio = 0.6
(df.groupby('ID')
.apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
.reset_index(drop=True))
# ID value
# 0 A 2
# 1 B 13
# 2 B 14
# 3 B 15
ratio = 0.4
(df.groupby('ID')
.apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
.reset_index(drop=True))
# ID value
# 0 B 14
# 1 B 15
Upvotes: 8
Reputation: 323396
Let us try two steps
s = df.groupby('ID').size()
out = df.groupby('ID').apply(lambda x : x.tail(s.loc[x.name]//2)).reset_index(drop=True)
out
Out[210]:
ID value
0 A 2
1 B 13
2 B 14
3 B 15
Upvotes: 5
Reputation: 150825
Like commented, there is no built-in option to do so. You can do something like:
groups = df.groupby('ID')
enums = groups.cumcount().add(1)
sizes = groups['ID'].transform('size')
df[enums/sizes > 0.5]
Output:
ID value
1 A 2
5 B 13
6 B 14
7 B 15
Upvotes: 8