nemo92world
nemo92world

Reputation: 101

group one column according to another column and sum of third column

A dataframe looks like this:

df = pd.DataFrame({'name':["a"," b", "c","d", "e","a"," a", "a"," b", "c","d", "e","a"," a"],
       'gender': ["male", "female", "female", "female", "male","male","male","female","female", 
                  "female", "male","male","male", "male"],
      'year':[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2019],
      'month':[1, 12, 4, 3, 6, 7, 2, 4, 5, 1, 12, 4, 3, 6 ],
      'count':[100, 30, 10, 90, 34, 100, 30, 10, 90, 34, 100, 30, 10, 90]})

It shows the name, gender, birth year and the birth month, and the number of people.

For example, in 2005 January there were 100 babies named “a”. I want to find top-10 frequent names for males and females, just as below:

enter image description here

df.sort_values(['gender','count'],ascending=False,inplace=True)
male = df[df['gender']=='male']['name'].head(10).to_list()
female = df[df['gender']=='female']['name'].head(10).to_list()
results = pd.DataFrame({'Male':pd.Series(male),'Female':pd.Series(female)})
print (results)

but apparently it gives duplicated amounts for example 5 a instead of total amount for a.I need to sum them up for example we have 100 'a' in 2005 and 100 'a' in 2009 and so on. so in total we need to add them up like 100+ 100 and.the final result must be like a is male and totally in 10 years 500 people were named a and so on.

Upvotes: 0

Views: 71

Answers (2)

jezrael
jezrael

Reputation: 862406

First aggregate sum for count column, then select male and female with Series.nlargest and return index values for names:

s = df.groupby(['gender','name'])['count'].sum()

results = pd.DataFrame({'Male':pd.Series(s.loc['male'].nlargest(10).index),
                        'Female':pd.Series(s.loc['female'].nlargest(10).index)})
print (results)

  Male Female
0    a      b
1    d      d
2    e      c
3  NaN      a

Upvotes: 2

KaRaOkEy
KaRaOkEy

Reputation: 328

I think you are looking for a multiple groupby:

by_year = df.groupby(['year', 'gender', 'name']).agg({'count': 'sum'})
print(by_year)

total = by_year.groupby(['gender', 'name']).agg({'count': 'sum'})
print(total)

If not, please further specify an expected result.

Upvotes: 0

Related Questions