Reputation: 91
I have a df with a race column, which has 4 categories. However, I would like to only have three categories by combining the last two categories. This is what my current df looks like:
Year |State| Race |Sex | population
-----|-----|------|----|----------
2006 | CA | 1 |1 | 504893
2006 | CA | 1 |2 | 783760
2006 | CA | 2 |1 | 800062
2006 | CA | 2 |2 | 768300
2006 | CA | 3 |1 | 347827
2006 | CA | 3 |2 | 344672
2006 | CA | 4 |1 | 565344
2006 | CA | 4 |2 | 356779
I want to consolidate the race==3 and race ==4 into one value (which would be race ==3). So my new df output would look something like this:
Year |State| Race |Sex | population
-----|-----|------|----|----------
2006 | CA | 1 |1 | 504893
2006 | CA | 1 |2 | 783760
2006 | CA | 2 |1 | 800062
2006 | CA | 2 |2 | 768300
2006 | CA | 3 |1 | 913171
2006 | CA | 3 |2 | 701451
Notice, the last two rows in this df are the sum of the the last 4 rows on the last df (by sex). How can I accomplish this?
Upvotes: 0
Views: 54
Reputation: 38415
Replace Race 4 by 3 and group data by Race + Sex
df.loc[df['Race']==4, 'Race']=3
df = df.groupby(['Race','Sex'],as_index=False)['population'].sum()
You get
Year State Race Sex population
2006 CA 1 1 504893
2006 CA 1 2 783760
2006 CA 2 1 800062
2006 CA 2 2 768300
2006 CA 3 1 913171
2006 CA 3 2 701451
Upvotes: 1
Reputation: 4313
Example code to show you how to do it, for simplify I just using the last two row of your dataset:
import pandas as pd
df = pd.DataFrame({'Race': [3, 3, 4, 4], 'Sex': [1, 2, 1, 2], 'population': [347827, 344672, 565344, 356779]})
for i in [1, 2]:
df.loc[(df['Sex'] == i) & (df['Race'] == 3), 'population'] += df.loc[(df['Sex'] == i) & (df['Race'] == 4), 'population'].item()
df.drop(df.loc[df['Race'] == 4].index, inplace=True)
print(df)
Outputs:
Race Sex population
0 3 1 913171
1 3 2 701451
Upvotes: 1