Most efficient way to groupby a dataframe and then filter in pandas

Question

Afternoon All,

I have a very large dataset which I have groupedby. Here is a sample :

df_ccy = df.groupby(['currency_str','state' 
 ['state'].count().reset_index(name='count').sort_values(['count'], ascending=False)

display(df_ccy)

Output:

currency_str    state           count
USD             Traded Away     148
AUD             Dealer Timeout  52
CAD             Done            44
USD             Covered         38
USD             Dealer Timeout  29
ZAR             Done            22

I would like to only show:

CAD             Done            44
ZAR             Done            22

I achieved this via:

display(df_ccy [df_ccy ['state']=='Done'][['currency_str','state','count']])

Should I use a Lambda function on the original groupby statement or filter as I have done above? What is best practice?

jezrael · Accepted Answer

df = pd.DataFrame({'currency_str': ['USD', 'AUD', 'CAD', 'CAD', 'ZAR', 
                                    'ZAR', 'USD', 'USD', 'ZAR'], 
                   'state': ['Traded Away', 'Dealer Timeout', 'Done', 'Done', 'Done',
                             'Done', 'Covered', 'Dealer Timeout', 'Done']})

print (df)
  currency_str           state
0          USD     Traded Away
1          AUD  Dealer Timeout
2          CAD            Done
3          CAD            Done
4          ZAR            Done
5          ZAR            Done
6          USD         Covered
7          USD  Dealer Timeout
8          ZAR            Done

I think you need filter first:

df1 = df[df['state']=='Done']
#alternative
#df1 = df.query("state == 'Done'")

And then count:

df_ccy = (df1.groupby(['currency_str','state'])['state']
            .count()
            .reset_index(name='count')
            .sort_values(['count'], ascending=False))

print (df_ccy)
  currency_str state  count
1          ZAR  Done      3
0          CAD  Done      2

Or if is not important column state with same filtered value:

df_ccy = (df1['currency_str'].value_counts()
                            .reset_index(name='count')
                            .rename(columns={'index':'currency_str'}))
print (df_ccy)
  currency_str  count
0          ZAR      3
1          CAD      2

Most efficient way to groupby a dataframe and then filter in pandas

Answers (1)

Related Questions