Combining cell values based on two columns in pandas dataframe

Question

I'm learning to use python and pandas and I'm wondering how can I combine something from different rows based on the cell values in two columns. In this case, i want to combine scores from same department and within same id.

This is the simplified version of the DF I have:

 id  department  name  scoreA
abc       sales  eric       2
abc       sales  jack       3
abc   marketing sofia       6
abc   marketing  anna       7
zzz       sales  jack       1
zzz       sales  eric       8
zzz   marketing sofia      11
zzz   marketing  anna       1

And this is the DF I want:

 id  department totalScoreA
abc       sales           5
abc   marketing          13
zzz       sales           9
zzz   marketing          12

I have also a follow up question. What if there's two columns for counts and I want an average of those counts, but before averaging those values I want to multiply the scoreB by 2. Like this:

 id  department  name  scoreA  scoreB
abc       sales  eric       2      10
abc       sales  jack       3       6
abc   marketing sofia       6       8
abc   marketing  anna       7      10
zzz       sales  eric       8      10
zzz       sales  jack       2      10
zzz   marketing sofia      11       4
zzz   marketing  anna       1      10

And this is the DF I want:

 id  department totalScoreA  AverageScore((A+B*2)/2)
abc       sales           5                     18.5
abc   marketing          13                     24.5
zzz       sales          10                       25
zzz   marketing          12                       20

Update:

Hey thank you so much for the answers @jezrael! The first one worked as it should!

However I might've been little bit too vague for defining the second question. What I wanted is to get an "combined" mean of all the scoreB*2's ScoreA's for every group of every department. I give an example with values to clarify this:

From this:

 id  department   name  scoreA  scoreB
zzz   marketing  sofia       5       4
zzz   marketing   anna       2

To this:

The meanAB is (5+2+4*2)/3 (the number 3 comes from the count of the values). So how would i calculate that one because i wasn't able to do it, even with the help of your previous solution :/

 id  department  meanA  meanB  meanAB
zzz   marketing    3.5      4       5

jezrael · Accepted Answer

First is aggregation sum with DataFrame.groupby:

df1 = df.groupby(['id', 'department'], as_index=False, sort=False)['scoreA'].sum()
print (df1)
    id department  scoreA
0  abc      sales       5
1  abc  marketing      13
2  zzz      sales      10
3  zzz  marketing      12

And second is first multiple column scoreB, added scoreA and used GroupBy.agg with dictionary of aggregate functions, here sum and mean:

df2 = (df.assign(scoreB = df['scoreB'] * 2 + df['scoreA'])
         .groupby(['id', 'department'], as_index=False, sort=False)
         .agg({'scoreA':'sum', 'scoreB':'mean'}))
print (df2)

    id department  scoreA  scoreB
0  abc      sales       5    18.5
1  abc  marketing      13    24.5
2  zzz      sales      10    25.0
3  zzz  marketing      12    20.0

EDIT:

print (df)
    id department   name  scoreA  scoreB
0  abc      sales   eric       2    10.0
1  abc      sales   jack       3     6.0
2  abc  marketing  sofia       6     8.0
3  abc  marketing   anna       7    10.0
4  abc  marketing   eric       8    10.0 <-changed data
5  zzz      sales   jack       2    10.0
6  zzz  marketing  sofia       5     4.0 <-changed data
7  zzz  marketing   anna       2     NaN <-changed data

Create new column Count by DataFrame.count function for get number of values with explude NaNs, then aggregate sum and divide for mean:

df2 = (df.assign(scoreB = df['scoreB'].mul(2).add(df['scoreA'], fill_value=0), 
                 Count = df[['scoreA','scoreB']].count(1))
         .groupby(['id', 'department'], as_index=False, sort=False)
         .sum())
print (df2)
    id department  scoreA  scoreB  Count
0  abc      sales       5    37.0      4
1  abc  marketing      21    77.0      6
2  zzz      sales       2    22.0      2
3  zzz  marketing       7    15.0      3

df2['scoreB'] /= df2.pop('Count')
print (df2)
    id department  scoreA     scoreB
0  abc      sales       5   9.250000
1  abc  marketing      21  12.833333
2  zzz      sales       2  11.000000
3  zzz  marketing       7   5.000000

Detail:

print (df.assign(scoreB = df['scoreB'].mul(2).add(df['scoreA'], fill_value=0), 
                 Count = df[['scoreA','scoreB']].count(1)))
    id department   name  scoreA  scoreB  Count
0  abc      sales   eric       2    22.0      2
1  abc      sales   jack       3    15.0      2
2  abc  marketing  sofia       6    22.0      2
3  abc  marketing   anna       7    27.0      2
4  abc  marketing   eric       8    28.0      2
5  zzz      sales   jack       2    22.0      2
6  zzz  marketing  sofia       5    13.0      2
7  zzz  marketing   anna       2     2.0      1

Combining cell values based on two columns in pandas dataframe

Update:

Answers (1)

Related Questions