iraciv94
iraciv94

Reputation: 840

Select top n columns based on another column

I have a database as the following:

enter image description here

And I would like to obtain a pandas dataframe filtered for the 2 rows per date, based on the top ones that have the highest population. The output should look like this:

enter image description here

I know that pandas offers a formula called nlargest: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nlargest.html

but I don't think it is usable for this use case. Is there any workaround?

Thanks so much in advance!

Upvotes: 1

Views: 1212

Answers (1)

Karn Kumar
Karn Kumar

Reputation: 8826

I have mimicked your dataframe as below and provided a way forward to get the desired result.

Your Dataframe:

>>> df
        Date country  population
0 2019-12-31       A         100
1 2019-12-31       B          10
2 2019-12-31       C        1000
3 2020-01-01       A         200
4 2020-01-01       B          20
5 2020-01-01       C        3500
6 2020-01-01       D          12
7 2020-02-01       D        2000
8 2020-02-01       E          54

Your Desired Solution:

You can use nlargest method along with set_index ans groupby method.

This is what you will get..

>>> df.set_index('country').groupby('Date')['population'].nlargest(2)
Date        country
2019-12-31  C          1000
            A           100
2020-01-01  C          3500
            A           200
2020-02-01  D          2000
            E            54
Name: population, dtype: int64

Now, as you want the DataFrame into original state by resetting the index of the DataFrame, which will give you following ..

>>> df.set_index('country').groupby('Date')['population'].nlargest(2).reset_index()
        Date country  population
0 2019-12-31       C        1000
1 2019-12-31       A         100
2 2020-01-01       C        3500
3 2020-01-01       A         200
4 2020-02-01       D        2000
5 2020-02-01       E          54

Another way around:

With groupby and apply function use reset_index with parameter drop=True and level= ..

>>> df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=[0,1], drop=True)
  # df.groupby('Date').apply(lambda p: p.nlargest(2, columns='population')).reset_index(level=['Date',1], drop=True)
        Date country  population
0 2019-12-31       C        1000
1 2019-12-31       A         100
2 2020-01-01       C        3500
3 2020-01-01       A         200
4 2020-02-01       D        2000
5 2020-02-01       E          54

Upvotes: 2

Related Questions