Reputation: 403
In R, I can summarize the data using more than one data columns as follows: library(dplyr):
A = B %>%
group_by(col1,col2) %>%
summarize(newcol = sum(col3)/sum(col4))
But in python's pandas' dataframes, how do I perform the same operation in one step?
I can do this in two steps. Step 1:
A = B.groupby(['col1','col2']).agg({'col3': 'sum','col4':'sum'})
Step 2:
A['newcol'] = A['col3']/A['col4']
Upvotes: 4
Views: 3687
Reputation: 3825
With datar
, you can do it the same way you did in R:
from datar import f
from datar.base import sum
from datar.dplyr import group_by, summarise
A = (
B
>> group_by(f.col1,f.col2)
>> summarize(newcol = sum(f.col3)/sum(f.col4))
)
Upvotes: 1
Reputation:
You need to use assign with a lambda expression:
df = pd.DataFrame({'col1': list('aaabbb'),
'col2': list('xyxyxy'),
'col3': np.random.randn(6),
'col4': np.random.randn(6)})
df
Out:
col1 col2 col3 col4
0 a x -2.276155 0.323778
1 a y -0.367525 -2.570142
2 a x -0.672530 2.265560
3 b y 0.588741 0.193499
4 b x -1.368829 0.717997
5 b y 1.012271 1.354408
(df.groupby(['col1','col2'])
.agg({'col3': 'sum','col4':'sum'})
.assign(newcol=lambda x: x['col3']/x['col4']))
Out:
col4 col3 newcol
col1 col2
a x 2.589338 -2.948686 -1.138780
y -2.570142 -0.367525 0.142998
b x 0.717997 -1.368829 -1.906453
y 1.547907 1.601012 1.034308
If all you need is the new column, use apply:
df.groupby(['col1','col2']).apply(lambda x: x['col3'].sum() / x['col4'].sum())
Out:
col1 col2
a x -1.138780
y 0.142998
b x -1.906453
y 1.034308
dtype: float64
If you are using this on a big data set, avoid apply and use eval instead.
(df.groupby(['col1','col2'])
.agg({'col3': 'sum','col4':'sum'})
.eval('col3 / col4'))
Upvotes: 7