Reputation: 387
I am trying to aggregate values in a groupby over multiple columns. I come from the R/dplyr world and what I want is usually achievable in a single line using group_by/summarize. I am trying to find an equivalently elegant way of achieving this using pandas.
Consider the below Input Dataset. I would like to aggregate by state and calculate the column v1 as v1 = sum(n1)/sum(d1) by state.
The r-code for this using dplyr is as follows:
input %>% group_by(state) %>%
summarise(v1=sum(n1)/sum(d1),
v2=sum(n2)/sum(d2))
Is there an elegant way of doing this in Python? I found a slightly verbose way of getting what I want in on a stack overflow answer here. Copying over modified python-code from the link
In [14]: s = mn.groupby('state', as_index=False).sum()
In [15]: s['v1'] = s['n1'] / s['d1']
In [16]: s['v2'] = s['n2'] / s['d2']
In [17]: s[['state', 'v1', 'v2']]
INPUT DATASET
state n1 n2 d1 d2
CA 100 1000 1 2
FL 200 2000 2 4
CA 300 3000 3 6
AL 400 4000 4 8
FL 500 5000 5 2
NY 600 6000 6 4
CA 700 7000 7 6
OUTPUT
state v1 v2
AL 100 500.000000
CA 100 500.000000
NY 100 1500.000000
CA 100 1166.666667
FL 100 1166.666667
Upvotes: 4
Views: 741
Reputation: 28729
Another option is with the pipe
function, where the groupby object is resuable:
(df.groupby('state')
.pipe(lambda df: pd.DataFrame({'v1' : df.n1.sum() / df.d1.sum(),
'v2' : df.n2.sum() / df.d2.sum()})
)
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Another option would be to convert the columns into a MultiIndex before grouping:
temp = temp = df.set_index('state')
temp.columns = temp.columns.str.split('(\d)', expand=True).droplevel(-1)
(temp.groupby('state')
.sum()
.pipe(lambda df: df.n /df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Yet another way, still with the MultiIndex option, while avoiding a groupby:
# keep the index, necessary for unstacking later
temp = df.set_index('state', append=True)
# convert the columns to a MultiIndex
temp.columns = temp.columns.map(tuple)
# this works because the index is unique
(temp.unstack('state')
.sum()
.unstack([0,1])
.pipe(lambda df: df.n / df.d)
.add_prefix('v')
)
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Upvotes: 1
Reputation: 3855
Here is the equivalent way as you did in R:
>>> from datar.all import f, tribble, group_by, summarise, sum
>>>
>>> input = tribble(
... f.state, f.n1, f.n2, f.d1, f.d2,
... "CA", 100, 1000, 1, 2,
... "FL", 200, 2000, 2, 4,
... "CA", 300, 3000, 3, 6,
... "AL", 400, 4000, 4, 8,
... "FL", 500, 5000, 5, 2,
... "NY", 600, 6000, 6, 4,
... "CA", 700, 7000, 7, 6,
... )
>>>
>>> input >> group_by(f.state) >> \
... summarise(v1=sum(f.n1)/sum(f.d1),
... v2=sum(f.n2)/sum(f.d2))
state v1 v2
<object> <float64> <float64>
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
I am the author of the datar
package.
Upvotes: 1
Reputation: 150815
Another solution:
def func(x):
u = x.sum()
return pd.Series({'v1':u['n1']/u['d1'],
'v2':u['n2']/u['d2']})
df.groupby('state').apply(func)
Output:
v1 v2
state
AL 100.0 500.000000
CA 100.0 785.714286
FL 100.0 1166.666667
NY 100.0 1500.000000
Upvotes: 1
Reputation: 863701
One possible solution with DataFrame.assign
and DataFrame.reindex
:
df = (mn.groupby('state', as_index=False)
.sum()
.assign(v1 = lambda x: x['n1'] / x['d1'], v2 = lambda x: x['n2'] / x['d2'])
.reindex(['state', 'v1', 'v2'], axis=1))
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
And another with GroupBy.apply
and custom lambda function:
df = (mn.groupby('state')
.apply(lambda x: x[['n1','n2']].sum() / x[['d1','d2']].sum().values)
.reset_index()
.rename(columns={'n1':'v1', 'n2':'v2'})
)
print (df)
state v1 v2
0 AL 100.0 500.000000
1 CA 100.0 785.714286
2 FL 100.0 1166.666667
3 NY 100.0 1500.000000
Upvotes: 1