VMEscoli
VMEscoli

Reputation: 1232

How to sum or count groups of multiple columns in pandas

I'm trying to group several group of columns to count or sum the rows in a pandas dataframe

I've checked many questions already and the most similar I found is this one > Groupby sum and count on multiple columns in python, but, by what I understand I have to do many steps to reach my goal. and was also looking at this link

As an example, I have the dataframe below:

import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"])

     grey2   red1 blue1 red2 red3 blue2 grey1
0       4      3    0      2    4   0   2
1       4      2    0      4    0   3   1
2       1      1    3      1    1   3   1
3       4      4    1      4    1   1   1
4       3      4    1      0    3   3   1

I want to group here, all the columns by colour, for example, and what I would expect is:

If I sum the numbers,

blue  15
grey  22
red   34

If I count ( x > 0 ) then I will get,

  blue  7
  grey  10
  red   13

this is what I have achieved so far, so now i will have to sum and then create a dataframe with the results, but if I have 100 groups,this would be very time consuming.

pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='sum', margins=True)
   red1  red2   red3
0    3     2    4
1    2     4    0
2    1     1    1
3    4     4    1
4    4     0    3
ALL  14   11    9

pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='count', margins=True)

But here is also counting the zeros:

     red1 red2  red3
   0    1   1   1
   1    1   1   1
   2    1   1   1
   3    1   1   1
   4    1   1   1
  All   5   5   5

Not sure how to alter the function to get my results, and I've already spend hours, hopefully you can help.

NOTE: I only use colours in this example to simplify the case, but I could have around many columns and they are called col001 till col300, etc... So, the groups could be:

blue = col131, col254, col005
red =  col023, col190, col053

and so on.....

Upvotes: 4

Views: 544

Answers (4)

screenpaver
screenpaver

Reputation: 1130

You could also do something like this for the general case:

colors = {'blue':['blue1','blue2'], 'red':['red1','red2','red3'], 'grey':['grey1','grey2']}
orig_columns = df.columns
df.columns = [key for col in df.columns for key in colors.keys() if col in colors[key]]
print(df.groupby(level=0,axis=1).sum().sum())
df.columns = orig_columns

Upvotes: 0

ansev
ansev

Reputation: 30920

Use:

df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()

Output:

blue    15
grey    22
red     34
dtype: int64

this works regardless of the number of digits contained in the name of the columns:

df=df.add_suffix('22')
print(df)

   grey22222  red12222  blue12222  red22222  red32222  blue22222  grey12222
0          4         3          0         2         4          0          2
1          4         2          0         4         0          3          1
2          1         1          3         1         1          3          1
3          4         4          1         4         1          1          1
4          3         4          1         0         3          3          1

df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue    15
grey    22
red     34
dtype: int64

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150765

You can use pd.wide_to_long:

data= pd.wide_to_long(df.reset_index(), stubnames=['grey','red','blue'], 
                i='index',
                j='group',
                sep=''
               )

Output:

# data
             grey  red  blue
index group                 
0     1       2.0    3   0.0
      2       4.0    2   0.0
      3       NaN    4   NaN
1     1       1.0    2   0.0
      2       4.0    4   3.0
      3       NaN    0   NaN
2     1       1.0    1   3.0
      2       1.0    1   3.0
      3       NaN    1   NaN
3     1       1.0    4   1.0
      2       4.0    4   1.0
      3       NaN    1   NaN
4     1       1.0    4   1.0
      2       3.0    0   3.0
      3       NaN    3   NaN

And:

data.sum()
# grey    22.0
# red     34.0
# blue    15.0
# dtype: float64

data.gt(0).sum()
# grey    10
# red     13
# blue     7
# dtype: int64

Update wide_to_long is just a convenient shortcut for merge and rename. So if you have a dictionary {cat:[col_list]}, you could resolve to that:

groups = {'blue' : ['col131', 'col254', 'col005'],
          'red' : ['col023', 'col190', 'col053']}

# create the inverse dictionary for mapping
inv_group = {v:k for k,v in groups.items()}

data = df.melt()

# map the original columns to group
data['group'] = data['variable'].map(inv_group)

# from now on, it's similar to other answers
# sum
data.groupby('group')['value'].sum()

# count
data['value'].gt(0).groupby(data['group']).sum()

Upvotes: 3

ALollz
ALollz

Reputation: 59549

The complication here is that you want to collapse both by rows and columns, which is generally difficult to do at the same time. We can melt to go from your wide format to a longer format, which then reduces the problem to a single groupby

# Get rid of the numbers + reshape
df.columns = pd.Index(df.columns.str.rstrip('0123456789'), name='color')
df = df.melt()

df.groupby('color').sum()
#       value
#color       
#blue      15
#grey      22
#red       34

df.value.gt(0).groupby(df.color).sum()
#color
#blue     7.0
#grey    10.0
#red     13.0
#Name: value, dtype: float64

With names that are less simple to group, we'd need to have the mapping somewhere, the steps are very similar:

# Unnecessary in this case, but more general
d = {'grey1': 'color_1', 'grey2': 'color_1', 
     'red1': 'color_2', 'red2': 'color_2', 'red3': 'color_2',
     'blue1': 'color_3', 'blue2': 'color_3'}

df.columns = pd.Index(df.columns.map(d), name='color')
df = df.melt()
df.groupby('color').sum()

#         value
#color         
#color_1     22
#color_2     34
#color_3     15

Upvotes: 1

Related Questions