Reputation: 77
I have a time series dataset which looks like this:
Date Newspaper City1 City2 Region1Total City3 City4 Region2Total
2017-12-01 NewsPaper1 231563 8696 240259 21072 8998 30070
2017-12-01 NewsPaper2 173009 12180 185189 28910 5550 34460
2017-12-01 NewsPaper3 40511 4600 45111 5040 3330 8370
2017-12-01 NewsPaper4 37770 2980 40750 6520 1880 8400
2017-12-01 NewsPaper5 5176 900 6076 1790 5000 6790
2017-12-01 NewsPaper6 137650 8025 145675 25300 11000 36300
2017-12-01 Total 637547 38201 675748 91032 36558 127590
2018-01-01 NewsPaper1 231295 8391 239686 8790 21176 29966
2018-01-01 NewsPaper2 169937 12130 182067 7890 28850 36740
2018-01-01 NewsPaper3 40453 4570 45023 4750 5055 9800
2018-01-01 NewsPaper4 37766 2970 40736 2500 6540 9040
2018-01-01 NewsPaper5 5136 900 6036 5600 1795 7365
2018-01-01 NewsPaper6 137990 8010 146000 14500 25330 39830
2018-01-01 Total 633919 37786 671705 44980 91141 136121
I am trying to find the max n values in each column of this dataframe. I tried the following method
somelist = []
data = pd.read_excel('newspaper.csv')
data.index = pd.to_datetime(data['Date'], errors='coerce')
last_month = data.loc[data.index[-1]] # i am considering only the previous month(latest month in the dataframe)
last_month.set_index('Newspaper', inplace = True)
for city in last_month.iloc[:, 2: ]:
top_3 = last_month[city].nlargest(4)[1: ] #highest will be total but we should skip it
somelist.append(top_3)
print(somelist)
This produces the result as pandas series with the name of the column mentioned below as:
[Newspaper
Newspaper1 231295
Newspaper2 169937
Newspaper6 137990
Name: City1, dtype: float64, Newspaper
Newspaper2 12130.0
Newspaper1 8391.0
Newspaper6 8010.0
Name: City2, dtype: float64, Newspaper
Newspaper1 240259
Newspaper2 185189
Newspaper6 145675
Name: Region1Total, dtype: float64, Newspaper
Newspaper6 14500.0
Newspaper1 8790.0
Newspaper2 7890.0
Name: City3, dtype: float64, Newspaper
Newspaper2 28850.0
Newspaper6 25330.0
Newspaper1 21176.0
Name: City4, dtype: float64, Newspaper
Newspaper6 36300
Newspaper2 34460
Newspaper1 34460
Name: Region2Total, dtype: float64, Newspaper]
What I want is the the top 3 selling newspapers in each city and region along with the sales numbers arranged in the descending order. I also want the name of the city/region to be printed before showing the top 3 result.
Expected output is a list or a series like below:
Newspaper City1
Newspaper1 231295
Newspaper2 169937
Newspaper6 137990
Newspaper City2
Newspaper2 12130.0
Newspaper1 8391.0
Newspaper6 8010.0
Newspaper Region1Total
Newspaper1 240259
Newspaper2 185189
Newspaper6 145675
Newspaper City3
Newspaper6 14500.0
Newspaper1 8790.0
Newspaper2 7890.0
Newspaper City4
Newspaper2 28850.0
Newspaper6 25330.0
Newspaper1 21176.0
Newspaper Region2Total
Newspaper6 36300
Newspaper2 34460
Newspaper1 34460
Also, if i want to skip the regions and just consider the cities then how can i do it for the same? Any help would be appreciated. Thank you in advance.
Upvotes: 3
Views: 3633
Reputation: 999
import pandas as pd
# Setup the data
data = pd.DataFrame({'Date': {0: '2017-12-01',
1: '2017-12-01',
2: '2017-12-01',
3: '2017-12-01',
4: '2017-12-01',
5: '2017-12-01'},
'Newspaper': {0: 'NewsPaper1',
1: 'NewsPaper2',
2: 'NewsPaper3',
3: 'NewsPaper4',
4: 'NewsPaper5',
5: 'NewsPaper6'},
'City1': {0: 231563, 1: 173009, 2: 40511, 3: 37770, 4: 5176, 5: 137650},
'City2': {0: 8696, 1: 12180, 2: 4600, 3: 2980, 4: 900, 5: 8025},
'Region1Total': {0: 240259,
1: 185189,
2: 45111,
3: 40750,
4: 6076,
5: 145675},
'City3': {0: 21072, 1: 28910, 2: 5040, 3: 6520, 4: 1790, 5: 25300},
'City4': {0: 8998, 1: 5550, 2: 3330, 3: 1880, 4: 5000, 5: 11000},
'Region2Total': {0: 30070, 1: 34460, 2: 8370, 3: 8400, 4: 6790, 5: 36300}}
)
# Not all columns are required, only the Newspaper and any 'City' column
cleaned_data = data[[i for i in data.columns if 'City' in i] + ['Newspaper']]
# Change the structure
df = cleaned_data.set_index('Newspaper').unstack()
# Get the top 3 values for each city
df = df.groupby(level=0).apply(lambda df: df.sort_values(ascending=False)[:4])
df.index = df.index.droplevel(0)
df
Out[]:
Newspaper
City1 NewsPaper1 231563
NewsPaper2 173009
NewsPaper6 137650
NewsPaper3 40511
City2 NewsPaper2 12180
NewsPaper1 8696
NewsPaper6 8025
NewsPaper3 4600
City3 NewsPaper2 28910
NewsPaper6 25300
NewsPaper1 21072
NewsPaper4 6520
City4 NewsPaper6 11000
NewsPaper1 8998
NewsPaper2 5550
NewsPaper5 5000
Upvotes: 0
Reputation: 8641
First of all, you need to get a dataframe where only Newspapers are listed, not total.
dff = df.loc[df['Newspaper']!='Total']
Then for city1
, you can do:
dff[['Newspaper', 'City1']].sort_values(['City1'], ascending=False).head(3)
Output:
Newspaper City1
0 NewsPaper1 231563
1 NewsPaper2 173009
5 NewsPaper6 137650
Similarly, you can achieve results for all the columns of interest.
Upvotes: 5