Reputation: 34517
Pandas in Python and Dplyr in R are both flexible data wrangling tools. For example, in R, with dplyr one can do the following;
custom_func <- function(col1, col2) length(col1) + length(col2)
ChickWeight %>%
group_by(Diet) %>%
summarise(m_weight = mean(weight),
var_time = var(Time),
covar = cov(weight, Time),
odd_stat = custom_func(weight, Time))
Notice how in one statement;
Is such a pattern also possible in pandas? Note that I am interested in doing this in a short statement (so not creating three different dataframes and then joining them).
Upvotes: 5
Views: 2672
Reputation: 3825
I am bringing datar
here for you to translate your R code into python:
>>> from pipda import register_func
>>> from datar.all import f, length, group_by, summarise, mean, var, cov
>>> from datar.datasets import ChickWeight
>>>
>>> @register_func(None)
... def custom_func(col1, col2):
... return length(col1) + length(col2)
...
>>> ChickWeight >> \
... group_by(f.Diet) >> \
... summarise(
... m_weight=mean(f.weight),
... var_time = var(f.Time),
... covar = cov(f.weight, f.Time),
... odd_stat = custom_func(f.weight, f.Time)
... )
Diet m_weight var_time covar odd_stat
<int64> <float64> <float64> <float64> <int64>
0 1 102.645455 45.422397 312.189871 440
1 2 122.616667 45.909722 398.564426 240
2 3 142.950000 45.909722 528.827731 240
3 4 135.262712 45.100618 441.868535 236
I am the author of the package. Feel free to submit issues if you have any questions.
Upvotes: 1
Reputation: 107567
With pandas groupby.apply() you can run multiple functions in a groupby aggregation. Please note for statistical functions you would need scipy
installed. For custom functions will need to run an aggregate like sum()
for groupwise data:
def customfct(x,y):
data = x / y
return data.mean()
def f(row):
row['m_weight'] = row['weight'].mean()
row['var_time'] = row['Time'].var()
row['cov'] = row['weight'].cov(row['Time'])
row['odd_stat'] = customfct(row['weight'], row['Time'])
return row
aggdf = df.groupby('Diet').apply(f)
Upvotes: 4
Reputation: 30414
Since version, 0.16.0, pandas has offered the assign()
method. I'm not sure exactly how it compares to pandas-ply as mentioned by @akrun, but it is part of pandas proper. I think that would handle all of your needs except maybe the groupby part (or possibly I just don't know how to combine with groupby).
Documentation for pandas assign
Inspired by dplyr’s mutate verb, DataFrame has a new assign() method. The function signature for assign is simply **kwargs. The keys are the column names for the new fields, and the values are either a value to be inserted (for example, a Series or NumPy array), or a function of one argument to be called on the DataFrame. The new values are inserted, and the entire DataFrame (with all original and new columns) is returned.
Looking at your example code, I'm not sure how to do the groupby, but you certainly can do this sort of thing. (Actually, I may post a followup SO question if I can't figure it out.)
df.assign( mean_weight = lambda x: x['weight'].mean(),
var_time = lambda x: x['time'].var(),
covar = lambda x: np.cov(x.t1,x.t2)[0,0] )
Or, can do it like this:
df.assign( mean_weight = df['weight'].mean(),
var_time = df['time'].var(),
covar = np.cov(df['t1'],df['t2'])[0,0] )
Alternatively, you could include the groupby like this (mixing the alternate ways below):
df.assign( mean_weight = df['weight'].groupby(df.diet).transform('mean'),
var_time = lambda x: x['time'].groupby(x['diet']).transform('mean') )
But you have to do a groupby 4x, not 1x, so that's not real satisfying as a solution...
I'll play around with that syntax a little more and see if I can get it closer to your example (you may want to provide sample data for a better answer).
Alternatively you can do standard groupby techniques to get your answer, but I think it takes multiple statements (can't just do one long chained line) because of the complexity -- some of your assignments can be combined with groupby's agg
but I don't think the user function can.
Upvotes: 2