eager_learner
eager_learner

Reputation: 154

Pandas + groupby

The dataset contains 4 columns, where name is the name of the child, yearofbirth denotes the year in which the child was born, number represents the number of babies who were named with that particular name.

   For example, entry 1 reads, in the year 1880, 7065 girl children were named Mary.

HEAD of dataset

Through pandas, I`m trying to find out every year which name was the most used one. My code

   df.groupby(['yearofbirth']).agg({'number':'max'}).reset_index()

The above code partially answers the question in hand.

Result of query

I want to want the name along with the maximum number.

Upvotes: 8

Views: 248

Answers (3)

Praveen
Praveen

Reputation: 9365

You can try this; sort and get the last val

df.sort_values('number').groupby('yearofbirth').tail(1)

or you can use reindex

df.reindex(df.groupby('yearofbirth').number.idxmax())

demo:

In [1]: df = pd.DataFrame({'name':list('abcaac'),
    ...:                    'yearofbirth':[1800,1800,1801,1801,1802,1802],
    ...:                    'number':[7,8,9,4,2,3],
    ...:                    'sex':['F'] * 6,
    ...: })

In [2]: df.sort_values('number').groupby('yearofbirth').tail(1)
Out[2]:
  name  yearofbirth  number sex
5    c         1802       3   F
1    b         1800       8   F
2    c         1801       9   F


In [3]: df.reindex(df.groupby('yearofbirth').number.idxmax())
Out[3]:
       name  yearofbirth  number sex
number
1         b         1800       8   F
2         c         1801       9   F
5         c         1802       3   F

In [4]: df.loc[df.groupby('yearofbirth').number.idxmax()]
Out[4]:
  name  yearofbirth  number sex
1    b         1800       8   F
2    c         1801       9   F
5    c         1802       3   F

Upvotes: 2

jezrael
jezrael

Reputation: 863751

I think need if each year have only one maximum value - sort_values with drop_duplicates:

df = pd.DataFrame({'name':list('abcaac'),
                   'yearofbirth':[1800,1800,1801,1801,1802,1802],
                   'number':[7,8,9,4,2,3],
                   'sex':['F'] * 6,
})

print (df)
  name  yearofbirth  number sex
0    a         1800       7   F
1    b         1800       8   F
2    c         1801       9   F
3    a         1801       4   F
4    a         1802       2   F
5    c         1802       3   F

df1 = (df.sort_values(['yearofbirth', 'number'], ascending=[True, False])
         .drop_duplicates('yearofbirth'))
print (df1)
  name  yearofbirth  number sex
1    b         1800       8   F
2    c         1801       9   F
5    c         1802       3   F

If posssible multiple max values per year use @Teoretic solution.

Upvotes: 2

Teoretic
Teoretic

Reputation: 2543

Based on answers from this question I came up with this solution:

idx = df.groupby(['yearofbirth'])['number'].transform(max) == df['number']
df = df[idx]

print(df)

    name    number  sex yearofbirth
0   Mary    7065    F   1880

Upvotes: 4

Related Questions