Reputation: 101
A dataframe looks like this:
df = pd.DataFrame({'name':["a"," b", "c","d", "e","a"," a", "a"," b", "c","d", "e","a"," a"],
'gender': ["male", "female", "female", "female", "male","male","male","female","female",
"female", "male","male","male", "male"],
'year':[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2019],
'month':[1, 12, 4, 3, 6, 7, 2, 4, 5, 1, 12, 4, 3, 6 ],
'count':[100, 30, 10, 90, 34, 100, 30, 10, 90, 34, 100, 30, 10, 90]})
It shows the name, gender, birth year and the birth month, and the number of people.
For example, in 2005 January there were 100 babies named “a”. I want to find top-10 frequent names for males and females, just as below:
df.sort_values(['gender','count'],ascending=False,inplace=True)
male = df[df['gender']=='male']['name'].head(10).to_list()
female = df[df['gender']=='female']['name'].head(10).to_list()
results = pd.DataFrame({'Male':pd.Series(male),'Female':pd.Series(female)})
print (results)
but apparently it gives duplicated amounts for example 5 a instead of total amount for a.I need to sum them up for example we have 100 'a' in 2005 and 100 'a' in 2009 and so on. so in total we need to add them up like 100+ 100 and.the final result must be like a is male and totally in 10 years 500 people were named a and so on.
Upvotes: 0
Views: 71
Reputation: 862406
First aggregate sum
for count
column, then select male
and female
with Series.nlargest
and return index values for names:
s = df.groupby(['gender','name'])['count'].sum()
results = pd.DataFrame({'Male':pd.Series(s.loc['male'].nlargest(10).index),
'Female':pd.Series(s.loc['female'].nlargest(10).index)})
print (results)
Male Female
0 a b
1 d d
2 e c
3 NaN a
Upvotes: 2
Reputation: 328
I think you are looking for a multiple groupby:
by_year = df.groupby(['year', 'gender', 'name']).agg({'count': 'sum'})
print(by_year)
total = by_year.groupby(['gender', 'name']).agg({'count': 'sum'})
print(total)
If not, please further specify an expected result.
Upvotes: 0