Relax ZeroC
Relax ZeroC

Reputation: 683

[Pandas]how to get top-n% records within each group

This is my dataFrame

df = pd.DataFrame([['@1','A',40],['@2','A',60],['@3','A',47],['@4','B',33],['@5','B',69],['@6','B',22],['@7','B',90],['@8
','C',31],['@9','C',78],['@10','C',12],['@11','C',89],['@12','C',88],['@13','C',99]],columns=['id','channel','score'])

     id channel  score
0    @1       A     40
1    @2       A     60
2    @3       A     47
3    @4       B     33
4    @5       B     69
5    @6       B     22
6    @7       B     90
7    @8       C     31
8    @9       C     78
9   @10       C     12
10  @11       C     89
11  @12       C     88
12  @13       C     99

Each channel has its own total number , I set a percent number = 80%

and I want to take int(channel'num * 0.8) nlargest , so it's will be

A channel take int(3*0.8) = 2
B channel take int(4*0.8) = 3
C channel take int(6*0.8) = 4

     id channel  score
1    @2       A     60
2    @3       A     47
3    @4       B     33
4    @5       B     69
6    @7       B     90
8    @9       C     78
10  @11       C     89
11  @12       C     88
12  @13       C     99

How can I do that , thanks.

Upvotes: 9

Views: 5574

Answers (2)

user1587520
user1587520

Reputation: 4513

There is another solution that does not require the usage of apply: Enumerate the rows in each group using cumcount and devide that by the group size to get the percentile the row belongs to in the group. Based on this you can create a mask to select the rows you want from the DataFrame:

key = 'channel'

# Group position for each row
group_idx = df.groupby(key).cumcount()
# Group size for each row
group_size = df.groupby(key)[key].transform('size')

mask = (group_idx/group_size) < 0.8
group_top_pct = df[mask]

Upvotes: 0

jezrael
jezrael

Reputation: 862511

Use groupby with nlargest:

a = 0.8

df1 = (df.groupby('channel',group_keys=False)
        .apply(lambda x: x.nlargest(int(len(x) * a), 'score')))
print (df1)     
     id channel  score
1    @2       A     60
2    @3       A     47
6    @7       B     90
4    @5       B     69
3    @4       B     33
12  @13       C     99
10  @11       C     89
11  @12       C     88
8    @9       C     78

Another solution with sort_values + groupby + head:

df1 = (df.sort_values('score', ascending=False)
        .groupby('channel',group_keys=False)
        .apply(lambda x: x.head(int(len(x) * a)))
        .reset_index(drop=True))

print (df1)
    id channel  score
0   @2       A     60
1   @3       A     47
2   @7       B     90
3   @5       B     69
4   @4       B     33
5  @13       C     99
6  @11       C     89
7  @12       C     88
8   @9       C     78

Upvotes: 11

Related Questions