Reputation: 1232
I'm trying to group several group of columns to count or sum the rows in a pandas dataframe
I've checked many questions already and the most similar I found is this one > Groupby sum and count on multiple columns in python, but, by what I understand I have to do many steps to reach my goal. and was also looking at this link
As an example, I have the dataframe below:
import numpy as np
df = pd.DataFrame(np.random.randint(0,5,size=(5, 7)), columns=["grey2","red1","blue1","red2","red3","blue2","grey1"])
grey2 red1 blue1 red2 red3 blue2 grey1
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
I want to group here, all the columns by colour, for example, and what I would expect is:
If I sum the numbers,
blue 15
grey 22
red 34
If I count ( x > 0 ) then I will get,
blue 7
grey 10
red 13
this is what I have achieved so far, so now i will have to sum and then create a dataframe with the results, but if I have 100 groups,this would be very time consuming.
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='sum', margins=True)
red1 red2 red3
0 3 2 4
1 2 4 0
2 1 1 1
3 4 4 1
4 4 0 3
ALL 14 11 9
pd.pivot_table(data=df, index=df.index, values=["red1","red2","red3"], aggfunc='count', margins=True)
But here is also counting the zeros:
red1 red2 red3
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
All 5 5 5
Not sure how to alter the function to get my results, and I've already spend hours, hopefully you can help.
NOTE: I only use colours in this example to simplify the case, but I could have around many columns and they are called col001 till col300, etc... So, the groups could be:
blue = col131, col254, col005
red = col023, col190, col053
and so on.....
Upvotes: 4
Views: 544
Reputation: 1130
You could also do something like this for the general case:
colors = {'blue':['blue1','blue2'], 'red':['red1','red2','red3'], 'grey':['grey1','grey2']}
orig_columns = df.columns
df.columns = [key for col in df.columns for key in colors.keys() if col in colors[key]]
print(df.groupby(level=0,axis=1).sum().sum())
df.columns = orig_columns
Upvotes: 0
Reputation: 30920
Use:
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
Output:
blue 15
grey 22
red 34
dtype: int64
this works regardless of the number of digits contained in the name of the columns:
df=df.add_suffix('22')
print(df)
grey22222 red12222 blue12222 red22222 red32222 blue22222 grey12222
0 4 3 0 2 4 0 2
1 4 2 0 4 0 3 1
2 1 1 3 1 1 3 1
3 4 4 1 4 1 1 1
4 3 4 1 0 3 3 1
df.groupby(df.columns.str.replace('\d+', ''),axis=1).sum().sum()
blue 15
grey 22
red 34
dtype: int64
Upvotes: 0
Reputation: 150765
You can use pd.wide_to_long
:
data= pd.wide_to_long(df.reset_index(), stubnames=['grey','red','blue'],
i='index',
j='group',
sep=''
)
Output:
# data
grey red blue
index group
0 1 2.0 3 0.0
2 4.0 2 0.0
3 NaN 4 NaN
1 1 1.0 2 0.0
2 4.0 4 3.0
3 NaN 0 NaN
2 1 1.0 1 3.0
2 1.0 1 3.0
3 NaN 1 NaN
3 1 1.0 4 1.0
2 4.0 4 1.0
3 NaN 1 NaN
4 1 1.0 4 1.0
2 3.0 0 3.0
3 NaN 3 NaN
And:
data.sum()
# grey 22.0
# red 34.0
# blue 15.0
# dtype: float64
data.gt(0).sum()
# grey 10
# red 13
# blue 7
# dtype: int64
Update wide_to_long
is just a convenient shortcut for merge
and rename
. So if you have a dictionary {cat:[col_list]}
, you could resolve to that:
groups = {'blue' : ['col131', 'col254', 'col005'],
'red' : ['col023', 'col190', 'col053']}
# create the inverse dictionary for mapping
inv_group = {v:k for k,v in groups.items()}
data = df.melt()
# map the original columns to group
data['group'] = data['variable'].map(inv_group)
# from now on, it's similar to other answers
# sum
data.groupby('group')['value'].sum()
# count
data['value'].gt(0).groupby(data['group']).sum()
Upvotes: 3
Reputation: 59549
The complication here is that you want to collapse both by rows and columns, which is generally difficult to do at the same time. We can melt
to go from your wide format to a longer format, which then reduces the problem to a single groupby
# Get rid of the numbers + reshape
df.columns = pd.Index(df.columns.str.rstrip('0123456789'), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#blue 15
#grey 22
#red 34
df.value.gt(0).groupby(df.color).sum()
#color
#blue 7.0
#grey 10.0
#red 13.0
#Name: value, dtype: float64
With names that are less simple to group, we'd need to have the mapping somewhere, the steps are very similar:
# Unnecessary in this case, but more general
d = {'grey1': 'color_1', 'grey2': 'color_1',
'red1': 'color_2', 'red2': 'color_2', 'red3': 'color_2',
'blue1': 'color_3', 'blue2': 'color_3'}
df.columns = pd.Index(df.columns.map(d), name='color')
df = df.melt()
df.groupby('color').sum()
# value
#color
#color_1 22
#color_2 34
#color_3 15
Upvotes: 1