user3635284
user3635284

Reputation: 513

Pandas: Grouping by values when a column is a list

I have a DataFrame like this one:

df = pd.DataFrame({'type':[[1,3],[1,2,3],[2,3]], 'value':[4,5,6]})

type | value
-------------
1,3  | 4
1,2,3| 5
2,3  | 6

I would like to group by the different values in the 'type' column so for example the sum of value would be:

type | sum
------------
1    | 9
2    | 11
3    | 15

Thanks for your help!

Upvotes: 2

Views: 79

Answers (1)

jezrael
jezrael

Reputation: 863226

You need first reshape Dataframe by column type by DataFrame constructor, stack and reset_index. Then cast column type to int and last groupby with aggregating sum:

df1 = pd.DataFrame(df['type'].values.tolist(), index = df['value']) \
        .stack() \
        .reset_index(name='type')
df1.type = df1.type.astype(int)
print (df1)
   value  level_1  type
0      4        0     1
1      4        1     3
2      5        0     1
3      5        1     2
4      5        2     3
5      6        0     2
6      6        1     3


print (df1.groupby('type', as_index=False)['value'].sum())
   type  value
0     1      9
1     2     11
2     3     15

Another solution with join:

df1 = pd.DataFrame(df['type'].values.tolist()) \
        .stack() \
        .reset_index(level=1, drop=True) \
        .rename('type') \
        .astype(int)
print (df1)
0    1
0    3
1    1
1    2
1    3
2    2
2    3
Name: type, dtype: int32

df2 = df[['value']].join(df1)
print (df2)
   value  type
0      4     1
0      4     3
1      5     1
1      5     2
1      5     3
2      6     2
2      6     3

print (df2.groupby('type', as_index=False)['value'].sum())
   type  value
0     1      9
1     2     11
2     3     15

Version with Series where select first level of index by get_level_values, convert to Series by to_series and aggregate sum. Last reset_index and rename column index to type:

df1 = pd.DataFrame(df['type'].values.tolist(), index = df['value']).stack().astype(int)
print (df1)
value   
4      0    1
       1    3
5      0    1
       1    2
       2    3
6      0    2
       1    3
dtype: int32

print (df1.index.get_level_values(0)
          .to_series()
          .groupby(df1.values)
          .sum()
          .reset_index()
          .rename(columns={'index':'type'}))
   type  value
0     1      9
1     2     11
2     3     15

Edit by comment - it is a bit modified second solution with DataFrame.pop:

df = pd.DataFrame({'type':[[1,3],[1,2,3],[2,3]], 
                   'value1':[4,5,6], 
                   'value2':[1,2,3], 
                   'value3':[4,6,1]})
print (df)
        type  value1  value2  value3
0     [1, 3]       4       1       4
1  [1, 2, 3]       5       2       6
2     [2, 3]       6       3       1

df1 = pd.DataFrame(df.pop('type').values.tolist()) \
        .stack() \
        .reset_index(level=1, drop=True) \
        .rename('type') \
        .astype(int)
print (df1)
0    1
0    3
1    1
1    2
1    3
2    2
2    3
Name: type, dtype: int32

print (df.join(df1).groupby('type', as_index=False).sum())
   type  value1  value2  value3
0     1       9       3      10
1     2      11       5       7
2     3      15       6      11

Upvotes: 2

Related Questions