Pratham
Pratham

Reputation: 179

R's group_by->filter->summarise equivalent in pandas for quick prototyping?

Have to convert some codes from R to python.

in R, using dplyr, we do following:

df %>%
group_by(col_a, col_b) %>%
summarise( a = sum(col_c == 'a'),
      b = sum(col_c == 'b'),
      c = b/a
)

Looking at some older answers, suggestion to use apply method, and wrapping our requirements in a function. Creating functions is rather a slow process, specially when we have to try and create multiple new columns to experiment.

Can we do something similar to R example I have given, in pandas, in a similar way?

I have actually tried something similar, but pandas is much much slower (taking around 1s, while dplyr takes 200ms):

Just one example:

df.groupby('id').agg({'out':[lambda x:sum(x==4)]})

I was able to make it faster by filtering the dataset before grouping and aggregating:

df.assign(out=df.out==4).groupby('id').agg({'out':sum})

But this takes away the freedom of doing multiple filters and comparing them in a single line of code. ie, I can not make a filter of df.out==4 and df.out==3 etc in a single line, put them in variables, then go ahead and take a ratio/sum of those two.

Have tried to google a lot, but not getting any answers.

Upvotes: 0

Views: 233

Answers (2)

Panwen Wang
Panwen Wang

Reputation: 3855

As simple as:

from datar.all import f, group_by, summarise, sum

df >> \
  group_by(f.col_a, f.col_b) >> \
  summarise(a = sum(f.col_c == 'a'),
            b = sum(f.col_c == 'b'),
            c = f.b/f.a
  )

I am the author of the datar package.

Upvotes: 2

Amit Bhartia
Amit Bhartia

Reputation: 11

Try using the method mentioned below. I used to face the same issue you are facing right now. This method is a bit verbose but it executes fast and in one flow. Plus it gives you the freedom to really get fancy if need be :). Hope it helps!

#basic imports
import numpy as np
import pandas as pd

df_summarized = df.assign( #create the columns you want to summarize before grouping using 'assign'
              out_four = np.where(df.out==4,1,0),
              out_three = np.where(df.out==3,1,0)
             ).groupby(['A','B']).agg( total = ('out',np.sum),
                                   four = ('out_four',np.sum),
                                   three = ('out_three',np.sum)
             ).assign( #create more custom columns (eg. ratios) based on the output of the aggregation
             four_by_three = lambda x: x.four / x.three,
             four_by_total = lambda x: x.four / x.total,
             three_by_total = lambda x: x.three / x.total,
             #you can also get really fancy and try to add columns like these
             three_normalized = lambda x: (x.three - x.three.mean()) / x.three.std(),
             four_perc_contribution = lambda x: x.four / x.four.sum(),
             total_over_A_total = lambda x: x.total / x.groupby('A').total.transform(np.sum)
             )

           

Upvotes: 1

Related Questions