Reputation: 25
I've got a dataframe that's currently aggregated by zip code, and looks similar to this:
Year Organization State Zip Number_of_people
2021 A NJ 07090 5
2020 B AZ 09876 3
2021 A NJ 01234 2
2021 C VA 23456 7
2019 A NJ 05385 1
I want to aggregate the dataframe and Number_of_People
column by state instead, combining identical rows (aside from Number of people) so that the data above instead looks like this:
Year Organization State Number_of_people
2021 A NJ 7
2020 B AZ 3
2021 C VA 7
2019 A NJ 1
In other words, if rows are identical in all columns EXCEPT Number_of_people
, I want to combine the rows and add the number_of_people
.
I'm stuck on how to approach this problem after deleting the Zip column -- I think I need to group by Year, Organization, and State but not sure what to do after that.
Upvotes: 0
Views: 159
Reputation: 825
A more pythonic version without zip codes
df.groupby(['Year','Organization','State'], as_index=False)['Number_of_people'].sum()
A more pythonic version with zip codes
df.groupby(['Year','Organization','State'], as_index=False).sum()
Upvotes: 1
Reputation: 5648
You don't have to drop zip first if you don't want, use the syntax below.
data = '''Year Organization State Zip Number_of_people
2021 A NJ 07090 5
2020 B AZ 09876 3
2021 A NJ 01234 2
2021 C VA 23456 7
2019 A NJ 05385 1'''
df = pd.read_csv(io.StringIO(data), sep='\s+', engine='python')
df[['Year','Organization','State', 'Number_of_people']].groupby(['Year','Organization','State']).sum().reset_index()
Output
Year Organization State Number_of_people
0 2019 A NJ 1
1 2020 B AZ 3
2 2021 A NJ 7
3 2021 C VA 7
If you do want to drop the zip code, then use this:
df.groupby(['Year','Organization','State']).sum().reset_index()
Upvotes: 1