Srinidhi Patil
Srinidhi Patil

Reputation: 77

Find top 3 in columns of a dataframe using pandas

I have a time series dataset which looks like this:

Date        Newspaper   City1    City2   Region1Total   City3   City4  Region2Total
2017-12-01  NewsPaper1  231563   8696    240259         21072   8998   30070
2017-12-01  NewsPaper2  173009   12180   185189         28910   5550   34460
2017-12-01  NewsPaper3  40511    4600    45111          5040    3330   8370
2017-12-01  NewsPaper4  37770    2980    40750          6520    1880   8400
2017-12-01  NewsPaper5  5176     900     6076           1790    5000   6790
2017-12-01  NewsPaper6  137650   8025    145675         25300  11000   36300
2017-12-01  Total       637547   38201   675748         91032  36558   127590

2018-01-01  NewsPaper1  231295   8391    239686         8790   21176   29966
2018-01-01  NewsPaper2  169937   12130   182067         7890   28850   36740
2018-01-01  NewsPaper3  40453    4570    45023          4750   5055    9800
2018-01-01  NewsPaper4  37766    2970    40736          2500   6540    9040
2018-01-01  NewsPaper5  5136     900     6036           5600   1795    7365
2018-01-01  NewsPaper6  137990   8010    146000         14500  25330   39830
2018-01-01  Total       633919   37786   671705         44980  91141   136121 

I am trying to find the max n values in each column of this dataframe. I tried the following method

somelist = []
data = pd.read_excel('newspaper.csv')
data.index = pd.to_datetime(data['Date'], errors='coerce')
last_month = data.loc[data.index[-1]] # i am considering only the previous month(latest month in the dataframe)
last_month.set_index('Newspaper', inplace = True)
for city in last_month.iloc[:, 2: ]:
    top_3 = last_month[city].nlargest(4)[1: ] #highest will be total but we should skip it
    somelist.append(top_3)
print(somelist)

This produces the result as pandas series with the name of the column mentioned below as:

    [Newspaper
    Newspaper1    231295
    Newspaper2    169937
    Newspaper6    137990
    Name: City1, dtype: float64, Newspaper
    Newspaper2    12130.0
    Newspaper1     8391.0
    Newspaper6     8010.0
    Name: City2, dtype: float64, Newspaper
    Newspaper1    240259
    Newspaper2    185189
    Newspaper6    145675
    Name: Region1Total, dtype: float64, Newspaper
    Newspaper6    14500.0
    Newspaper1     8790.0
    Newspaper2     7890.0
    Name: City3, dtype: float64, Newspaper
    Newspaper2    28850.0
    Newspaper6    25330.0
    Newspaper1    21176.0
    Name: City4, dtype: float64, Newspaper
    Newspaper6    36300
    Newspaper2    34460
    Newspaper1    34460
    Name: Region2Total, dtype: float64, Newspaper]

What I want is the the top 3 selling newspapers in each city and region along with the sales numbers arranged in the descending order. I also want the name of the city/region to be printed before showing the top 3 result.

Expected output is a list or a series like below:

Newspaper     City1
Newspaper1    231295
Newspaper2    169937
Newspaper6    137990

Newspaper     City2
Newspaper2    12130.0
Newspaper1     8391.0
Newspaper6     8010.0

Newspaper     Region1Total
Newspaper1    240259
Newspaper2    185189
Newspaper6    145675

Newspaper     City3
Newspaper6    14500.0
Newspaper1     8790.0
Newspaper2     7890.0

Newspaper     City4
Newspaper2    28850.0
Newspaper6    25330.0
Newspaper1    21176.0

Newspaper     Region2Total
Newspaper6    36300
Newspaper2    34460
Newspaper1    34460

Also, if i want to skip the regions and just consider the cities then how can i do it for the same? Any help would be appreciated. Thank you in advance.

Upvotes: 3

Views: 3633

Answers (2)

Dillon
Dillon

Reputation: 999

import pandas as pd

# Setup the data
data = pd.DataFrame({'Date': {0: '2017-12-01',
  1: '2017-12-01',
  2: '2017-12-01',
  3: '2017-12-01',
  4: '2017-12-01',
  5: '2017-12-01'},
 'Newspaper': {0: 'NewsPaper1',
  1: 'NewsPaper2',
  2: 'NewsPaper3',
  3: 'NewsPaper4',
  4: 'NewsPaper5',
  5: 'NewsPaper6'},
 'City1': {0: 231563, 1: 173009, 2: 40511, 3: 37770, 4: 5176, 5: 137650},
 'City2': {0: 8696, 1: 12180, 2: 4600, 3: 2980, 4: 900, 5: 8025},
 'Region1Total': {0: 240259,
  1: 185189,
  2: 45111,
  3: 40750,
  4: 6076,
  5: 145675},
 'City3': {0: 21072, 1: 28910, 2: 5040, 3: 6520, 4: 1790, 5: 25300},
 'City4': {0: 8998, 1: 5550, 2: 3330, 3: 1880, 4: 5000, 5: 11000},
 'Region2Total': {0: 30070, 1: 34460, 2: 8370, 3: 8400, 4: 6790, 5: 36300}}
)

# Not all columns are required, only the Newspaper and any 'City' column
cleaned_data = data[[i for i in data.columns if 'City' in i] + ['Newspaper']]

# Change the structure
df = cleaned_data.set_index('Newspaper').unstack()

# Get the top 3 values for each city
df = df.groupby(level=0).apply(lambda df: df.sort_values(ascending=False)[:4])
df.index = df.index.droplevel(0)


df
Out[]:
       Newspaper 
City1  NewsPaper1    231563
       NewsPaper2    173009
       NewsPaper6    137650
       NewsPaper3     40511
City2  NewsPaper2     12180
       NewsPaper1      8696
       NewsPaper6      8025
       NewsPaper3      4600
City3  NewsPaper2     28910
       NewsPaper6     25300
       NewsPaper1     21072
       NewsPaper4      6520
City4  NewsPaper6     11000
       NewsPaper1      8998
       NewsPaper2      5550
       NewsPaper5      5000

Upvotes: 0

harpan
harpan

Reputation: 8641

First of all, you need to get a dataframe where only Newspapers are listed, not total.

dff = df.loc[df['Newspaper']!='Total']

Then for city1, you can do:

dff[['Newspaper', 'City1']].sort_values(['City1'], ascending=False).head(3)

Output:

     Newspaper  City1
0   NewsPaper1  231563
1   NewsPaper2  173009
5   NewsPaper6  137650

Similarly, you can achieve results for all the columns of interest.

Upvotes: 5

Related Questions