Rocky
Rocky

Reputation: 105

Pandas groupby sum is giving wrong output

My DataFrame:

State = ['CA', 'CA', 'CA', 'CA', 'SR', 'SR', 'SR', 'SR']
County = ['UC', 'UC', 'DB', 'DB', 'DD', 'DD', 'DD', 'DD']
Count  = [100, 100, 150, 150, 200, 200, 200, 200]
df = pd.DataFrame({'State': State, 'County': County, 'Count' : Count})

Expected output:

 State  Count
0    CA    250
1    SR    200

This is what I am doing:

df_new = df.groupby(['State']).sum().reset_index()

This is what I am getting:

 State  Count
0    CA    500
1    SR    800

If County is repeated in State. it should ignore the duplicated entry. But, it is not doing it why?

Upvotes: 1

Views: 426

Answers (2)

Rishu S
Rishu S

Reputation: 3968

Another way of doing it would be using lambda functions

df.groupby(['State']).apply(lambda x: x.groupby(['County']).Count.first().sum()).reset_index()

Output:

State   0
0   CA  250
1   SR  200

Hope this helps :)

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153460

Try using drop_duplicates filtering the dataframe first then groupby and sum:

df.drop_duplicates(['State', 'County']).groupby('State').sum().reset_index()

Output:

  State  Count
0    CA    250
1    SR    200

Upvotes: 2

Related Questions