Reputation: 683
This is my dataFrame
df = pd.DataFrame([['@1','A',40],['@2','A',60],['@3','A',47],['@4','B',33],['@5','B',69],['@6','B',22],['@7','B',90],['@8
','C',31],['@9','C',78],['@10','C',12],['@11','C',89],['@12','C',88],['@13','C',99]],columns=['id','channel','score'])
id channel score
0 @1 A 40
1 @2 A 60
2 @3 A 47
3 @4 B 33
4 @5 B 69
5 @6 B 22
6 @7 B 90
7 @8 C 31
8 @9 C 78
9 @10 C 12
10 @11 C 89
11 @12 C 88
12 @13 C 99
Each channel has its own total number , I set a percent number = 80%
and I want to take int(channel'num * 0.8) nlargest , so it's will be
A channel take int(3*0.8) = 2
B channel take int(4*0.8) = 3
C channel take int(6*0.8) = 4
id channel score
1 @2 A 60
2 @3 A 47
3 @4 B 33
4 @5 B 69
6 @7 B 90
8 @9 C 78
10 @11 C 89
11 @12 C 88
12 @13 C 99
How can I do that , thanks.
Upvotes: 9
Views: 5574
Reputation: 4513
There is another solution that does not require the usage of apply
:
Enumerate the rows in each group using cumcount
and devide that by the group size to get the percentile the row belongs to in the group. Based on this you can create a mask to select the rows you want from the DataFrame:
key = 'channel'
# Group position for each row
group_idx = df.groupby(key).cumcount()
# Group size for each row
group_size = df.groupby(key)[key].transform('size')
mask = (group_idx/group_size) < 0.8
group_top_pct = df[mask]
Upvotes: 0
Reputation: 862511
a = 0.8
df1 = (df.groupby('channel',group_keys=False)
.apply(lambda x: x.nlargest(int(len(x) * a), 'score')))
print (df1)
id channel score
1 @2 A 60
2 @3 A 47
6 @7 B 90
4 @5 B 69
3 @4 B 33
12 @13 C 99
10 @11 C 89
11 @12 C 88
8 @9 C 78
Another solution with sort_values
+ groupby
+ head
:
df1 = (df.sort_values('score', ascending=False)
.groupby('channel',group_keys=False)
.apply(lambda x: x.head(int(len(x) * a)))
.reset_index(drop=True))
print (df1)
id channel score
0 @2 A 60
1 @3 A 47
2 @7 B 90
3 @5 B 69
4 @4 B 33
5 @13 C 99
6 @11 C 89
7 @12 C 88
8 @9 C 78
Upvotes: 11