Reputation: 37
I'm learning to use python and pandas and I'm wondering how can I combine something from different rows based on the cell values in two columns. In this case, i want to combine scores from same department and within same id.
This is the simplified version of the DF I have:
id department name scoreA
abc sales eric 2
abc sales jack 3
abc marketing sofia 6
abc marketing anna 7
zzz sales jack 1
zzz sales eric 8
zzz marketing sofia 11
zzz marketing anna 1
And this is the DF I want:
id department totalScoreA
abc sales 5
abc marketing 13
zzz sales 9
zzz marketing 12
I have also a follow up question. What if there's two columns for counts and I want an average of those counts, but before averaging those values I want to multiply the scoreB by 2. Like this:
id department name scoreA scoreB
abc sales eric 2 10
abc sales jack 3 6
abc marketing sofia 6 8
abc marketing anna 7 10
zzz sales eric 8 10
zzz sales jack 2 10
zzz marketing sofia 11 4
zzz marketing anna 1 10
And this is the DF I want:
id department totalScoreA AverageScore((A+B*2)/2)
abc sales 5 18.5
abc marketing 13 24.5
zzz sales 10 25
zzz marketing 12 20
Hey thank you so much for the answers @jezrael! The first one worked as it should!
However I might've been little bit too vague for defining the second question. What I wanted is to get an "combined" mean of all the scoreB*2
's ScoreA
's for every group of every department. I give an example with values to clarify this:
From this:
id department name scoreA scoreB
zzz marketing sofia 5 4
zzz marketing anna 2
To this:
The meanAB
is (5+2+4*2)/3
(the number 3 comes from the count of the values). So how would i calculate that one because i wasn't able to do it, even with the help of your previous solution :/
id department meanA meanB meanAB
zzz marketing 3.5 4 5
Upvotes: 1
Views: 261
Reputation: 862701
First is aggregation sum
with DataFrame.groupby
:
df1 = df.groupby(['id', 'department'], as_index=False, sort=False)['scoreA'].sum()
print (df1)
id department scoreA
0 abc sales 5
1 abc marketing 13
2 zzz sales 10
3 zzz marketing 12
And second is first multiple column scoreB
, added scoreA
and used GroupBy.agg
with dictionary of aggregate functions, here sum
and mean
:
df2 = (df.assign(scoreB = df['scoreB'] * 2 + df['scoreA'])
.groupby(['id', 'department'], as_index=False, sort=False)
.agg({'scoreA':'sum', 'scoreB':'mean'}))
print (df2)
id department scoreA scoreB
0 abc sales 5 18.5
1 abc marketing 13 24.5
2 zzz sales 10 25.0
3 zzz marketing 12 20.0
EDIT:
print (df)
id department name scoreA scoreB
0 abc sales eric 2 10.0
1 abc sales jack 3 6.0
2 abc marketing sofia 6 8.0
3 abc marketing anna 7 10.0
4 abc marketing eric 8 10.0 <-changed data
5 zzz sales jack 2 10.0
6 zzz marketing sofia 5 4.0 <-changed data
7 zzz marketing anna 2 NaN <-changed data
Create new column Count
by DataFrame.count
function for get number of values with explude NaNs, then aggregate sum
and divide for mean
:
df2 = (df.assign(scoreB = df['scoreB'].mul(2).add(df['scoreA'], fill_value=0),
Count = df[['scoreA','scoreB']].count(1))
.groupby(['id', 'department'], as_index=False, sort=False)
.sum())
print (df2)
id department scoreA scoreB Count
0 abc sales 5 37.0 4
1 abc marketing 21 77.0 6
2 zzz sales 2 22.0 2
3 zzz marketing 7 15.0 3
df2['scoreB'] /= df2.pop('Count')
print (df2)
id department scoreA scoreB
0 abc sales 5 9.250000
1 abc marketing 21 12.833333
2 zzz sales 2 11.000000
3 zzz marketing 7 5.000000
Detail:
print (df.assign(scoreB = df['scoreB'].mul(2).add(df['scoreA'], fill_value=0),
Count = df[['scoreA','scoreB']].count(1)))
id department name scoreA scoreB Count
0 abc sales eric 2 22.0 2
1 abc sales jack 3 15.0 2
2 abc marketing sofia 6 22.0 2
3 abc marketing anna 7 27.0 2
4 abc marketing eric 8 28.0 2
5 zzz sales jack 2 22.0 2
6 zzz marketing sofia 5 13.0 2
7 zzz marketing anna 2 2.0 1
Upvotes: 6