Tarandeepsingh
Tarandeepsingh

Reputation: 175

pandas - merge and sum nearly duplicate rows

I have a pandas data frame with several rows that are near-duplicates of each other, except for one value. my goal is to merge those rows and sum the distinct value.

here is an example:

names   count   subject
A       2       physics
A       3       physics
A       3       chemistry
B       2       literature
B       3       literature
B       1       economics
C       3       physics
C       2       chemistry

And here is what I would like:

names   count   subject
A       5       physics
A       3       chemistry
B       5       literature
B       1       economics
C       3       physics
C       2       chemistry

i tried using the groupby() and this is what I got:

df2 = df.groupby(["names","subject"]).agg({"count":"sum"})
names subject       count
A     chemistry       3
      physics         5
B     economics       1
      literature      5
C     chemistry       2
      physics         3

What would be the best way to tackle this issue? I've looked into the groupby() function but I still don't understand it very well.

Upvotes: 2

Views: 3122

Answers (2)

jezrael
jezrael

Reputation: 862751

You can add sort=False and as_index=False for avoid MultiIndex in aggregation of sum:

df2 = df.groupby(["names","subject"], sort=False, as_index=False).agg({"count":"sum"})
print (df2)
  names     subject  count
0     A     physics      5
1     A   chemistry      3
2     B  literature      5
3     B   economics      1
4     C     physics      3
5     C   chemistry      2

If need change order pass list of columns names:

df2 = df2[['names','count','subject']]
print (df2)
  names  count     subject
0     A      5     physics
1     A      3   chemistry
2     B      5  literature
3     B      1   economics
4     C      3     physics
5     C      2   chemistry

Also is possible pass original columns names (if columns passed to groupby and after groupby):

df2 = (df.groupby(["names","subject"], sort=False, as_index=False)["count"]
         .sum()
         .reindex(df.columns, axis=1))
print (df2)
  names  count     subject
0     A      5     physics
1     A      3   chemistry
2     B      5  literature
3     B      1   economics
4     C      3     physics
5     C      2   chemistry

Upvotes: 5

Thomas
Thomas

Reputation: 56

Simply specify as_index=False:

df.groupby(["names","subject"], as_index=False).sum()

Upvotes: 1

Related Questions