Reputation: 175
I have a pandas data frame with several rows that are near-duplicates of each other, except for one value. my goal is to merge those rows and sum the distinct value.
here is an example:
names count subject
A 2 physics
A 3 physics
A 3 chemistry
B 2 literature
B 3 literature
B 1 economics
C 3 physics
C 2 chemistry
And here is what I would like:
names count subject
A 5 physics
A 3 chemistry
B 5 literature
B 1 economics
C 3 physics
C 2 chemistry
i tried using the groupby() and this is what I got:
df2 = df.groupby(["names","subject"]).agg({"count":"sum"})
names subject count
A chemistry 3
physics 5
B economics 1
literature 5
C chemistry 2
physics 3
What would be the best way to tackle this issue? I've looked into the groupby() function but I still don't understand it very well.
Upvotes: 2
Views: 3122
Reputation: 862751
You can add sort=False
and as_index=False
for avoid MultiIndex
in aggregation of sum
:
df2 = df.groupby(["names","subject"], sort=False, as_index=False).agg({"count":"sum"})
print (df2)
names subject count
0 A physics 5
1 A chemistry 3
2 B literature 5
3 B economics 1
4 C physics 3
5 C chemistry 2
If need change order pass list of columns names:
df2 = df2[['names','count','subject']]
print (df2)
names count subject
0 A 5 physics
1 A 3 chemistry
2 B 5 literature
3 B 1 economics
4 C 3 physics
5 C 2 chemistry
Also is possible pass original columns names (if columns passed to groupby
and after groupby
):
df2 = (df.groupby(["names","subject"], sort=False, as_index=False)["count"]
.sum()
.reindex(df.columns, axis=1))
print (df2)
names count subject
0 A 5 physics
1 A 3 chemistry
2 B 5 literature
3 B 1 economics
4 C 3 physics
5 C 2 chemistry
Upvotes: 5
Reputation: 56
Simply specify as_index=False:
df.groupby(["names","subject"], as_index=False).sum()
Upvotes: 1