ktj1989
ktj1989

Reputation: 807

Groupby multiple items in same column

I am trying to groupby multiple items in multiple columns recursively. Wondering if anyone could help me out.

The following is an example.

import pandas as pd
import itertools

# i woiuld have more than two groups
data = {'group1': ['a'] * 3 + ['b'] * 3,
        'group2': list(range(1,4)) + list(range(1,4)),
        'num': [1, 2, 3, 10, 15, 20]} 

df = pd.DataFrame(data)
print(df)

desired_df = {'group1': ['a'] * 9,
              'group2': ['b'] * 9,
              'num_group1': list(range(1,4))*3,
              'num_group2': list(itertools.chain.from_iterable(itertools.repeat(x, 3) for x in list(range(1,4)))),
              'desired_column': [11, 12, 13, 16, 17, 18, 21, 22, 23]
              }
# desired column is the sum of 'num' from 'group1' and 'group2' in df
desired = pd.DataFrame(desired_df)
print(desired)   



# i have tried this...which obviously doesnt work
data1 = df.merge(df.drop(columns=['num']), left_on=['group1'], right_on=['group1'])
data1.groupby(['group2_x', 'group2_y'])['num'].sum()

I am sure I am missing something simple...any suggestions?

Upvotes: 3

Views: 94

Answers (3)

ansev
ansev

Reputation: 30940

You could try to use combinations from itertools:

from itertools import combinations
df2=pd.DataFrame([list(key)[0] +list(key)[1] for key in combinations(df.values.tolist(),2)])
df3=df2[df2[0].ne(df2[3])].reset_index(drop=True)
df3[5]=df3[5]+df3[2]
print(df3) 
df4=df3[[0,1,3,4,5]].reindex(columns=[0,3,1,4,5]).rename(columns={3:'group2',0:'group1',1:'num_group1',4:'num_group2',5:'desired_column'})
df_desired=df4.sort_values('desired_column').reset_index(drop=True)
print(df_desired)

Output:

   0  1  2  3  4   5
0  a  1  1  b  1  11
1  a  1  1  b  2  16
2  a  1  1  b  3  21
3  a  2  2  b  1  12
4  a  2  2  b  2  17
5  a  2  2  b  3  22
6  a  3  3  b  1  13
7  a  3  3  b  2  18
8  a  3  3  b  3  23
  group1 group2  num_group1  num_group2  desired_column
0      a      b           1           1              11
1      a      b           2           1              12
2      a      b           3           1              13
3      a      b           1           2              16
4      a      b           2           2              17
5      a      b           3           2              18
6      a      b           1           3              21
7      a      b           2           3              22
8      a      b           3           3              23

Upvotes: 1

a_guest
a_guest

Reputation: 36329

You can groupby, then compute combinations of the relevant columns between all groups and finally use these items to index into the original data frame:

group_names = ['group1', 'group2']
groups = (x[1][group_names].values for x in df.groupby('group1'))
combined = [tuple(tuple(y) for y in x) for x in it.product(*groups)]
df.set_index(group_names, inplace=True)
result = pd.Series([sum(df.loc[x, 'num'] for x in item) for item in combined],
                   index=pd.MultiIndex.from_tuples(combined))
print(result)

This gives the following result (the index corresponds to the combinations):

(a, 1)  (b, 1)    11
        (b, 2)    16
        (b, 3)    21
(a, 2)  (b, 1)    12
        (b, 2)    17
        (b, 3)    22
(a, 3)  (b, 1)    13
        (b, 2)    18
        (b, 3)    23

This also works for more than two groups, for example:

  group1  group2  num
0      a       1    1
1      a       2    2
2      a       3    3
3      b       1   10
4      b       2   15
5      b       3   20
6      c       1  100
7      c       2  200
8      c       3  300

(a, 1)  (b, 1)  (c, 1)    111
                (c, 2)    211
                (c, 3)    311
        (b, 2)  (c, 1)    116
                (c, 2)    216
                (c, 3)    316
        (b, 3)  (c, 1)    121
                (c, 2)    221
                (c, 3)    321
(a, 2)  (b, 1)  (c, 1)    112
                (c, 2)    212
                (c, 3)    312
        (b, 2)  (c, 1)    117
                (c, 2)    217
                (c, 3)    317
        (b, 3)  (c, 1)    122
                (c, 2)    222
                (c, 3)    322
(a, 3)  (b, 1)  (c, 1)    113
                (c, 2)    213
                (c, 3)    313
        (b, 2)  (c, 1)    118
                (c, 2)    218
                (c, 3)    318
        (b, 3)  (c, 1)    123
                (c, 2)    223
                (c, 3)    323

It also works for more than two columns, for example:

  group1  group2 group3  num
0      a       0      q    1
1      a       1      r    2
2      a       0      s    3
3      a       1      t    4
4      b       0      q   10
5      b       1      r   15
6      b       0      s   20
7      b       1      t   25

(a, 0, q)  (b, 0, q)    11
           (b, 1, r)    16
           (b, 0, s)    21
           (b, 1, t)    26
(a, 1, r)  (b, 0, q)    12
           (b, 1, r)    17
           (b, 0, s)    22
           (b, 1, t)    27
(a, 0, s)  (b, 0, q)    13
           (b, 1, r)    18
           (b, 0, s)    23
           (b, 1, t)    28
(a, 1, t)  (b, 0, q)    14
           (b, 1, r)    19
           (b, 0, s)    24
           (b, 1, t)    29

Upvotes: 1

BENY
BENY

Reputation: 323386

You can using

x, y = [y.assign(key=1) for x , y in df.groupby('group1')]
s=x.merge(y,on='key')
s['X']=s.num_x+s.num_y
s
  group1_x  group2_x  num_x  key group1_y  group2_y  num_y   X
0        a         1      1    1        b         1     10  11
1        a         1      1    1        b         2     15  16
2        a         1      1    1        b         3     20  21
3        a         2      2    1        b         1     10  12
4        a         2      2    1        b         2     15  17
5        a         2      2    1        b         3     20  22
6        a         3      3    1        b         1     10  13
7        a         3      3    1        b         2     15  18
8        a         3      3    1        b         3     20  23

Upvotes: 1

Related Questions