Reputation: 11

Python Pandas: getting the rows with highest value

Hello! I have a dataframe with year (1910 ~ 2014), name, count (number of occurrence of each name) as columns. I want to create a new dataframe that shows the name with highest occurrence per year, and I'm not entirely sure about how to do this. Thanks!

Upvotes: 0

Answers (4)

anky

Reputation: 75150

another way of doing this is sort the values of count and de-duplicate the Year column(faster too):

df.sort_values('Count', ascending=False).drop_duplicates(['Year'])

time results are below, you can try applying any method and see howmuch time each takes and apply accordingly:

%timeit df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
result: 917 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df.groupby('Year')['Count'].agg('idxmax')]
result: 1.06 ms ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df.groupby('Year')['Count'].idxmax(), :]
result: 1.13 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 0

Mikhail Stepanov

Reputation: 3800

I suppose groupby & apply is good approach:

df = pd.DataFrame({
    'Year': ['1910', '1910', '1911', '1911', '1911', '2014', '2014'], 
    'Name': ['Mary', 'Virginia', 'Elizabeth', 'Mary', 'Ann', 'Virginia', 'Elizabeth'],
    'Count': [848, 270, 254, 360, 451, 81, 380]
})

df
Out:
   Year       Name  Count
0  1910       Mary    848
1  1910   Virginia    270
2  1911  Elizabeth    254
3  1911       Mary    360
4  1911        Ann    451
5  2014   Virginia     81
6  2014  Elizabeth    380

df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
Out:
        Year       Name  Count
Year                          
1910 0  1910       Mary    848
1911 4  1911        Ann    451
2014 6  2014  Elizabeth    380

Also you can change head(1) by head(n) to get n most frequent names per year:

df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(2))
Out: 
        Year       Name  Count
Year                          
1910 0  1910       Mary    848
     1  1910   Virginia    270
1911 4  1911        Ann    451
     3  1911       Mary    360
2014 6  2014  Elizabeth    380
     5  2014   Virginia     81

If you don't like new additional index, drop it via .reset_index(level=0, drop=True):

top_names = df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
top_names.reset_index(level=0, drop=True)

Out: 
   Year       Name  Count
0  1910       Mary    848
4  1911        Ann    451
6  2014  Elizabeth    380

Upvotes: 0

Rohit Raj

Reputation: 339

Vectorized way

group = df.groupby('year')
df.loc[group['count'].agg('idxmax')]

Upvotes: 1

jerpint

Reputation: 409

Try this:

d = {'year': [1910, 1910, 1910,1920,1920,1920], 'name': ["Virginia", "Mary", "Elizabeth","Virginia", "Mary", "Elizabeth"], 'count': [848, 420, 747, 1048, 221, 147]}

df = pd.DataFrame(data=d)
df_results = pd.DataFrame(columns=df.columns)

years = pd.unique(df['year'])

for year in years:
    tmp_df = df.loc[df['year'] == year]
    tmp_df = tmp_df.sort_values(by='year')

    df_results = df_results.append(tmp_df.iloc[0])

Upvotes: 0

Python Pandas: getting the rows with highest value

Answers (4)

Related Questions