Reputation: 135
I am new to using Pandas and I am trying to restructure a dataframe to remove the duplicates in my first column, while also keeping the number of each duplicate, and taking the sum of values in the second column.
For example, I would like the conversion to look something like this:
[In]:
+---+------+-------+
| | Name | Value |
+---+------+-------+
| 0 | A | 5 |
| 1 | B | 5 |
| 2 | C | 10 |
| 3 | A | 15 |
| 4 | A | 5 |
| 5 | C | 10 |
+---+------+-------+
[Out]:
+---+------+--------+-------+
| | Name | Number | Total |
+---+------+--------+-------+
| 0 | A | 3 | 25 |
| 1 | B | 1 | 5 |
| 2 | C | 2 | 20 |
+---+------+--------+-------+
So far, I haven't been able to find an efficient method to do this. (Or even a working method.)
I will be working with several hundred thousand rows, so I will need to find a pretty efficient method.
Upvotes: 0
Views: 44
Reputation: 309
The pandas agg function on a groupby is what you want. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
Here is an example:
import pandas as pd
df=pd.DataFrame({'Name':['A','B','C','A','A','C'],
'Value':[5,5,10,15,5,10]})
df.groupby('Name').agg(['count','sum'])
Hope that helps.
Upvotes: 2