mlx
mlx

Reputation: 514

Keep X% last rows by group in Pandas

It's straightforward to keep the last N rows for every group in a dataframe with something like df.groupby('ID').tail(N).

In my case, groups have different sizes and I would like to keep the same % of each group rather than same number of rows.

e.g if we want to keep the last 50% rows for each group (based on ID) for the following :

df = pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'], 
'value' : [1,2,10,11,12,13,14,15]})

The result would be :

 pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'], 
    'value' : [2,13,14,15]})

How can we get to that ?

EDIT : If x% is not an int, we round to the smallest closer int.

Upvotes: 6

Views: 1516

Answers (3)

tdy
tdy

Reputation: 41487

groupby-apply-tail

Pass the desired size to tail() in a GroupBy.apply(). This is simpler than the iloc method below since it cleanly handles the "last 0 rows" case.

ratio = 0.6
(df.groupby('ID')
   .apply(lambda x: x.tail(int(ratio * len(x))))
   .reset_index(drop=True))

#   ID  value
# 0  A      2
# 1  B     13
# 2  B     14
# 3  B     15
ratio = 0.4
(df.groupby('ID')
   .apply(lambda x: x.tail(int(ratio * len(x))))
   .reset_index(drop=True))

#   ID  value
# 0  B     14
# 1  B     15

groupby-apply-iloc

Alternatively, index the desired size via iloc/slicing, but this is clunkier since [-0:] does not actually get the last 0 rows, so we have to check against that:

ratio = 0.6
(df.groupby('ID')
   .apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
   .reset_index(drop=True))

#   ID  value
# 0  A      2
# 1  B     13
# 2  B     14
# 3  B     15
ratio = 0.4
(df.groupby('ID')
   .apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
   .reset_index(drop=True))

#   ID  value
# 0  B     14
# 1  B     15

Upvotes: 8

BENY
BENY

Reputation: 323396

Let us try two steps

s = df.groupby('ID').size()
out = df.groupby('ID').apply(lambda x : x.tail(s.loc[x.name]//2)).reset_index(drop=True)
out
Out[210]: 
  ID  value
0  A      2
1  B     13
2  B     14
3  B     15

Upvotes: 5

Quang Hoang
Quang Hoang

Reputation: 150825

Like commented, there is no built-in option to do so. You can do something like:

groups = df.groupby('ID')

enums = groups.cumcount().add(1)
sizes = groups['ID'].transform('size')

df[enums/sizes > 0.5]

Output:

  ID  value
1  A      2
5  B     13
6  B     14
7  B     15

Upvotes: 8

Related Questions