Reputation: 807
I am trying to groupby multiple items in multiple columns recursively. Wondering if anyone could help me out.
The following is an example.
import pandas as pd
import itertools
# i woiuld have more than two groups
data = {'group1': ['a'] * 3 + ['b'] * 3,
'group2': list(range(1,4)) + list(range(1,4)),
'num': [1, 2, 3, 10, 15, 20]}
df = pd.DataFrame(data)
print(df)
desired_df = {'group1': ['a'] * 9,
'group2': ['b'] * 9,
'num_group1': list(range(1,4))*3,
'num_group2': list(itertools.chain.from_iterable(itertools.repeat(x, 3) for x in list(range(1,4)))),
'desired_column': [11, 12, 13, 16, 17, 18, 21, 22, 23]
}
# desired column is the sum of 'num' from 'group1' and 'group2' in df
desired = pd.DataFrame(desired_df)
print(desired)
# i have tried this...which obviously doesnt work
data1 = df.merge(df.drop(columns=['num']), left_on=['group1'], right_on=['group1'])
data1.groupby(['group2_x', 'group2_y'])['num'].sum()
I am sure I am missing something simple...any suggestions?
Upvotes: 3
Views: 94
Reputation: 30940
You could try to use combinations from itertools:
from itertools import combinations
df2=pd.DataFrame([list(key)[0] +list(key)[1] for key in combinations(df.values.tolist(),2)])
df3=df2[df2[0].ne(df2[3])].reset_index(drop=True)
df3[5]=df3[5]+df3[2]
print(df3)
df4=df3[[0,1,3,4,5]].reindex(columns=[0,3,1,4,5]).rename(columns={3:'group2',0:'group1',1:'num_group1',4:'num_group2',5:'desired_column'})
df_desired=df4.sort_values('desired_column').reset_index(drop=True)
print(df_desired)
Output:
0 1 2 3 4 5
0 a 1 1 b 1 11
1 a 1 1 b 2 16
2 a 1 1 b 3 21
3 a 2 2 b 1 12
4 a 2 2 b 2 17
5 a 2 2 b 3 22
6 a 3 3 b 1 13
7 a 3 3 b 2 18
8 a 3 3 b 3 23
group1 group2 num_group1 num_group2 desired_column
0 a b 1 1 11
1 a b 2 1 12
2 a b 3 1 13
3 a b 1 2 16
4 a b 2 2 17
5 a b 3 2 18
6 a b 1 3 21
7 a b 2 3 22
8 a b 3 3 23
Upvotes: 1
Reputation: 36329
You can groupby
, then compute combinations of the relevant columns between all groups and finally use these items to index into the original data frame:
group_names = ['group1', 'group2']
groups = (x[1][group_names].values for x in df.groupby('group1'))
combined = [tuple(tuple(y) for y in x) for x in it.product(*groups)]
df.set_index(group_names, inplace=True)
result = pd.Series([sum(df.loc[x, 'num'] for x in item) for item in combined],
index=pd.MultiIndex.from_tuples(combined))
print(result)
This gives the following result (the index corresponds to the combinations):
(a, 1) (b, 1) 11
(b, 2) 16
(b, 3) 21
(a, 2) (b, 1) 12
(b, 2) 17
(b, 3) 22
(a, 3) (b, 1) 13
(b, 2) 18
(b, 3) 23
This also works for more than two groups, for example:
group1 group2 num
0 a 1 1
1 a 2 2
2 a 3 3
3 b 1 10
4 b 2 15
5 b 3 20
6 c 1 100
7 c 2 200
8 c 3 300
(a, 1) (b, 1) (c, 1) 111
(c, 2) 211
(c, 3) 311
(b, 2) (c, 1) 116
(c, 2) 216
(c, 3) 316
(b, 3) (c, 1) 121
(c, 2) 221
(c, 3) 321
(a, 2) (b, 1) (c, 1) 112
(c, 2) 212
(c, 3) 312
(b, 2) (c, 1) 117
(c, 2) 217
(c, 3) 317
(b, 3) (c, 1) 122
(c, 2) 222
(c, 3) 322
(a, 3) (b, 1) (c, 1) 113
(c, 2) 213
(c, 3) 313
(b, 2) (c, 1) 118
(c, 2) 218
(c, 3) 318
(b, 3) (c, 1) 123
(c, 2) 223
(c, 3) 323
It also works for more than two columns, for example:
group1 group2 group3 num
0 a 0 q 1
1 a 1 r 2
2 a 0 s 3
3 a 1 t 4
4 b 0 q 10
5 b 1 r 15
6 b 0 s 20
7 b 1 t 25
(a, 0, q) (b, 0, q) 11
(b, 1, r) 16
(b, 0, s) 21
(b, 1, t) 26
(a, 1, r) (b, 0, q) 12
(b, 1, r) 17
(b, 0, s) 22
(b, 1, t) 27
(a, 0, s) (b, 0, q) 13
(b, 1, r) 18
(b, 0, s) 23
(b, 1, t) 28
(a, 1, t) (b, 0, q) 14
(b, 1, r) 19
(b, 0, s) 24
(b, 1, t) 29
Upvotes: 1
Reputation: 323386
You can using
x, y = [y.assign(key=1) for x , y in df.groupby('group1')]
s=x.merge(y,on='key')
s['X']=s.num_x+s.num_y
s
group1_x group2_x num_x key group1_y group2_y num_y X
0 a 1 1 1 b 1 10 11
1 a 1 1 1 b 2 15 16
2 a 1 1 1 b 3 20 21
3 a 2 2 1 b 1 10 12
4 a 2 2 1 b 2 15 17
5 a 2 2 1 b 3 20 22
6 a 3 3 1 b 1 10 13
7 a 3 3 1 b 2 15 18
8 a 3 3 1 b 3 20 23
Upvotes: 1