anon_hippo
anon_hippo

Reputation: 25

Aggregating similar rows in Pandas

I've got a dataframe that's currently aggregated by zip code, and looks similar to this:

Year   Organization   State   Zip      Number_of_people

2021   A              NJ      07090    5
2020   B              AZ      09876    3
2021   A              NJ      01234    2
2021   C              VA      23456    7
2019   A              NJ      05385    1

I want to aggregate the dataframe and Number_of_People column by state instead, combining identical rows (aside from Number of people) so that the data above instead looks like this:

Year   Organization   State   Number_of_people

2021   A              NJ      7
2020   B              AZ      3
2021   C              VA      7
2019   A              NJ      1

In other words, if rows are identical in all columns EXCEPT Number_of_people, I want to combine the rows and add the number_of_people.

I'm stuck on how to approach this problem after deleting the Zip column -- I think I need to group by Year, Organization, and State but not sure what to do after that.

Upvotes: 0

Views: 159

Answers (2)

Pawan Jain
Pawan Jain

Reputation: 825

A more pythonic version without zip codes

df.groupby(['Year','Organization','State'], as_index=False)['Number_of_people'].sum()

A more pythonic version with zip codes

df.groupby(['Year','Organization','State'], as_index=False).sum()

Upvotes: 1

Jonathan Leon
Jonathan Leon

Reputation: 5648

You don't have to drop zip first if you don't want, use the syntax below.

data = '''Year   Organization   State   Zip      Number_of_people
2021   A              NJ      07090    5
2020   B              AZ      09876    3
2021   A              NJ      01234    2
2021   C              VA      23456    7
2019   A              NJ      05385    1'''
df = pd.read_csv(io.StringIO(data), sep='\s+', engine='python')
df[['Year','Organization','State', 'Number_of_people']].groupby(['Year','Organization','State']).sum().reset_index()

Output

   Year Organization State  Number_of_people
0  2019            A    NJ                 1
1  2020            B    AZ                 3
2  2021            A    NJ                 7
3  2021            C    VA                 7

If you do want to drop the zip code, then use this:

df.groupby(['Year','Organization','State']).sum().reset_index()

Upvotes: 1

Related Questions