Reputation: 11
Hello! I have a dataframe with year (1910 ~ 2014), name, count (number of occurrence of each name) as columns. I want to create a new dataframe that shows the name with highest occurrence per year, and I'm not entirely sure about how to do this. Thanks!
Upvotes: 0
Views: 203
Reputation: 75080
another way of doing this is sort the values of count
and de-duplicate the Year
column(faster too):
df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
time results are below, you can try applying any method and see howmuch time each takes and apply accordingly:
%timeit df.sort_values('Count', ascending=False).drop_duplicates(['Year'])
result: 917 µs ± 13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df.groupby('Year')['Count'].agg('idxmax')]
result: 1.06 ms ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df.groupby('Year')['Count'].idxmax(), :]
result: 1.13 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 0
Reputation: 3790
I suppose groupby
& apply
is good approach:
df = pd.DataFrame({
'Year': ['1910', '1910', '1911', '1911', '1911', '2014', '2014'],
'Name': ['Mary', 'Virginia', 'Elizabeth', 'Mary', 'Ann', 'Virginia', 'Elizabeth'],
'Count': [848, 270, 254, 360, 451, 81, 380]
})
df
Out:
Year Name Count
0 1910 Mary 848
1 1910 Virginia 270
2 1911 Elizabeth 254
3 1911 Mary 360
4 1911 Ann 451
5 2014 Virginia 81
6 2014 Elizabeth 380
df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
Out:
Year Name Count
Year
1910 0 1910 Mary 848
1911 4 1911 Ann 451
2014 6 2014 Elizabeth 380
Also you can change head(1)
by head(n)
to get n
most frequent names per year:
df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(2))
Out:
Year Name Count
Year
1910 0 1910 Mary 848
1 1910 Virginia 270
1911 4 1911 Ann 451
3 1911 Mary 360
2014 6 2014 Elizabeth 380
5 2014 Virginia 81
If you don't like new additional index, drop it via .reset_index(level=0, drop=True)
:
top_names = df.groupby(['Year']).apply(lambda x: x.sort_values('Count', ascending=False).head(1))
top_names.reset_index(level=0, drop=True)
Out:
Year Name Count
0 1910 Mary 848
4 1911 Ann 451
6 2014 Elizabeth 380
Upvotes: 0
Reputation: 339
Vectorized way
group = df.groupby('year')
df.loc[group['count'].agg('idxmax')]
Upvotes: 1
Reputation: 409
Try this:
d = {'year': [1910, 1910, 1910,1920,1920,1920], 'name': ["Virginia", "Mary", "Elizabeth","Virginia", "Mary", "Elizabeth"], 'count': [848, 420, 747, 1048, 221, 147]}
df = pd.DataFrame(data=d)
df_results = pd.DataFrame(columns=df.columns)
years = pd.unique(df['year'])
for year in years:
tmp_df = df.loc[df['year'] == year]
tmp_df = tmp_df.sort_values(by='year')
df_results = df_results.append(tmp_df.iloc[0])
Upvotes: 0