Ashwini Desai
Ashwini Desai

Reputation: 11

How to use Groupby funtion to find highest value in a dataframe

I have the following dataset.I want to find quarter of which year has generated the highest number of install for each app used in the study?

        Installs          CR     Month  Year             Category
0         10000    Everyone  January   2018       ART_AND_DESIGN
1        500000    Everyone  January   2018       ART_AND_DESIGN
2       5000000    Everyone   August   2018       ART_AND_DESIGN
3      50000000        Teen     June   2018       ART_AND_DESIGN
4        100000    Everyone     June   2018       ART_AND_DESIGN
        ...         ...       ...   ...                  ...
10836      5000    Everyone     July   2017               FAMILY
10837       100    Everyone     July   2018               FAMILY
10838      1000    Everyone  January   2017              MEDICAL
10839      1000  Mature 17+  January   2015  BOOKS_AND_REFERENCE
10840  10000000    Everyone     July   2018            LIFESTYLE

Upvotes: 1

Views: 43

Answers (1)

jezrael
jezrael

Reputation: 862901

If need max values per quarters and Category use:

q = (pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%B%Y')
       .dt.to_period('Q').rename('Quarter'))

df = df.groupby([q,'Category'])['Installs'].max().reset_index()
print (df)
  Quarter             Category  Installs
0  2015Q1  BOOKS_AND_REFERENCE      1000
1  2017Q1              MEDICAL      1000
2  2017Q3               FAMILY      5000
3  2018Q1       ART_AND_DESIGN    500000
4  2018Q2       ART_AND_DESIGN  50000000
5  2018Q3       ART_AND_DESIGN   5000000
6  2018Q3               FAMILY       100

Or if need aggregate Installs per quarters and categories and get queraters of maximum Installs then use:

q = (pd.to_datetime(df['Month'] + df['Year'].astype(str), format='%B%Y')
       .dt.to_period('Q').rename('Quarter'))

df1 = df.groupby([q,'Category'])['Installs'].sum().reset_index()
print (df1)
  Quarter             Category  Installs
0  2015Q1  BOOKS_AND_REFERENCE      1000
1  2017Q1              MEDICAL      1000
2  2017Q3               FAMILY      5000
3  2018Q1       ART_AND_DESIGN    510000
4  2018Q2       ART_AND_DESIGN  50100000
5  2018Q3       ART_AND_DESIGN   5000000
6  2018Q3               FAMILY       100
7  2018Q3            LIFESTYLE  10000000

df2 = df1.loc[df1.groupby('Category')['Installs'].idxmax()]
print (df2)
  Quarter             Category  Installs
4  2018Q2       ART_AND_DESIGN  50100000
0  2015Q1  BOOKS_AND_REFERENCE      1000
2  2017Q3               FAMILY      5000
7  2018Q3            LIFESTYLE  10000000
1  2017Q1              MEDICAL      1000

Upvotes: 2

Related Questions