Fam
Fam

Reputation: 533

In python, the summarise (dplyr) function analogue

I have a panda dataframe df and I would like group by a variable 'house' and do specific operations in three other variables: 'var1', 'var2' and 'var3'. Suposse the three variables are numeric and 'var1' taking values 1,2,3.

data = {'house':['A', 'B', 'A', 'A', 'B', 'B', 'B'], 'var1':[3, 0, 1, 3,4,5,3], 'var2':[2, 0, 5, 1,4,1,3],'var3':[4, 2, 3, 3,0,5,1]}
df = pd.DataFrame(data) 
df

Now, I would like to create 3 new variables

  1. new_var1 = Count the times the var3 takes values == 3
  2. new_var2 = sum var2 (simple aggregate)
  3. new_var3 = sum var3 (simple aggregate)

If I were using the R programming language, I would do it instantly

require(dplyr)
data = data.frame('house'=c('A', 'B', 'A', 'A', 'B', 'B', 'B'), 
        'var1'=c(3, 0, 1, 3,4,5,3), 
        'var2'=c(2, 0, 5, 1,4,1,3),
        'var3'=c(4, 2, 3, 3,0,5,1))

df= data %>% group_by(house) %>% summarise(new_var1 = sum(var1 == 3),
                                       new_var2 = sum(var2),
                                       new_var2 = sum(var2))
df

In python, first, I group by

df.groupby(['house'])['var1','var2', 'var3']

But I would like to continue on the same line of code and I don't know how to do this. There is some analogue 'summarise' function in python?

Upvotes: 2

Views: 635

Answers (2)

Panwen Wang
Panwen Wang

Reputation: 3825

I have been porting data packages (dplyr, tidyr, tibble, etc) from R in python:

https://github.com/pwwang/datar

If you are familiar with those packages in R, and want to apply it in python, then it is here for you:

from datar import f
from datar.all import *

data = tibble(
  house=c('A', 'B', 'A', 'A', 'B', 'B', 'B'), 
  var1=c(3, 0, 1, 3,4,5,3), 
  var2=c(2, 0, 5, 1,4,1,3),
  var3=c(4, 2, 3, 3,0,5,1)
)

df= data >> group_by(f.house) >> summarise(new_var1 = sum(f.var1 == 3),
                                           new_var2 = sum(f.var2),
                                           new_var3 = sum(f.var3))
print(df)

Output:

  house  new_var1  new_var2  new_var3
0     A         2         8        10
1     B         1         8         8

Upvotes: 0

fmarm
fmarm

Reputation: 4284

You can do this using the agg method

(df.groupby(['house']).agg({'var1': lambda x: (x==3).sum(), 
                            'var2': 'sum',
                            'var3': 'sum'})
   .rename(columns={"var1": "new_var1", 
                    "var2": "new_var2",
                    "var3":"new_var3"})
)

Upvotes: 4

Related Questions