Van Peer
Van Peer

Reputation: 2167

pandas group by multiple columns and select top n rows

I have a Pandas DataFrame in the following structure

file,page_num,val,cls
f1,1,v1,c1
f1,1,v2,c2
f1,1,v3,c2
f1,1,v4,c2     # not required
f1,2,v5,c1
f1,2,v6,c2
f2,1,v7,c1
f2,1,v8,c2
f2,1,v9,c2
f2,1,v10,c2    # not required
f2,2,v11,c1
f2,2,v12,c2

I need to find top two rows of cls c2 for every page_num in file. Other (cls) classes should remain as is.

Desired output

file,page_num,val,cls
f1,1,v1,c1
f1,1,v2,c2
f1,1,v3,c2
f1,2,v5,c1
f1,2,v6,c2
f2,1,v7,c1
f2,1,v8,c2
f2,1,v9,c2
f2,2,v11,c1
f2,2,v12,c2

I am trying to select all classes except c2 and then use the following to find those rows which are required for c2.

df = pd.read_csv('sample_f.csv')

df1 = df[df.cls == 'c2'].groupby(['file', 'page_num'])
df2 = df1.apply(lambda x: x.sort_values(['cls']))
df3 = df2.reset_index(drop=True)

df = df.loc[(df.cls !=c2) & (??)]

Upvotes: 2

Views: 2283

Answers (2)

somiandras
somiandras

Reputation: 227

You can use head with groupby, and then concatenate with the rest of the dataframe, so assuming that you need to find the top two val in each group:

top_2 = (
    df[df['cls'] == 'c2']
    .sort_values('val', ascending=False)
    .groupby(['file', 'page_num'])
    .head(2)
    .reset_index()
)
rest = df[df['cls'] != 'c2']
final_df = pd.concat([top_2, rest])

(Edit: I added sorting by val, as from the question it is not clear whether the dataframe is already sorted or not.)

Upvotes: 3

cosmic_inquiry
cosmic_inquiry

Reputation: 2684

This is similar to answer that just posted. But yeah, you can use head:

import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO("""file,page_num,val,cls
f1,1,v1,c1
f1,1,v2,c2
f1,1,v3,c2
f1,1,v4,c2     
f1,2,v5,c1
f1,2,v6,c2
f2,1,v7,c1
f2,1,v8,c2
f2,1,v9,c2
f2,1,v10,c2    
f2,2,v11,c1
f2,2,v12,c2"""))
df.cls = df.cls.str.strip()
keep = df[df.cls == 'c2'].groupby(['file', 'page_num']).head(2).index
df.loc[df.loc[df.cls != 'c2'].index | keep]

Output:

   file  page_num  val cls
0    f1         1   v1  c1
1    f1         1   v2  c2
2    f1         1   v3  c2
4    f1         2   v5  c1
5    f1         2   v6  c2
6    f2         1   v7  c1
7    f2         1   v8  c2
8    f2         1   v9  c2
10   f2         2  v11  c1
11   f2         2  v12  c2

Upvotes: 2

Related Questions