Reputation: 105
My DataFrame:
State = ['CA', 'CA', 'CA', 'CA', 'SR', 'SR', 'SR', 'SR']
County = ['UC', 'UC', 'DB', 'DB', 'DD', 'DD', 'DD', 'DD']
Count = [100, 100, 150, 150, 200, 200, 200, 200]
df = pd.DataFrame({'State': State, 'County': County, 'Count' : Count})
Expected output:
State Count
0 CA 250
1 SR 200
This is what I am doing:
df_new = df.groupby(['State']).sum().reset_index()
This is what I am getting:
State Count
0 CA 500
1 SR 800
If County is repeated in State. it should ignore the duplicated entry. But, it is not doing it why?
Upvotes: 1
Views: 426
Reputation: 3968
Another way of doing it would be using lambda functions
df.groupby(['State']).apply(lambda x: x.groupby(['County']).Count.first().sum()).reset_index()
Output:
State 0
0 CA 250
1 SR 200
Hope this helps :)
Upvotes: 1
Reputation: 153460
Try using drop_duplicates
filtering the dataframe first then groupby and sum:
df.drop_duplicates(['State', 'County']).groupby('State').sum().reset_index()
Output:
State Count
0 CA 250
1 SR 200
Upvotes: 2