ah bon
ah bon

Reputation: 10011

Extract weekly data from daily and reshape it from long to wide format using Pandas

Given a sample data as follows, I hope to extract one data entry for each week, if for the week having multiple entries, then I will use the largest weekday's data as for that week:

          date  variable      value
0    2020-11-4  quantity   564.0000
1   2020-11-11  quantity   565.0000
2   2020-11-18  quantity   566.0000
3   2020-11-25  quantity   566.0000
4    2020-11-2     price  1829.1039
5    2020-11-3     price  1789.5883
6    2020-11-4     price  1755.4307
7    2020-11-5     price  1750.0727
8    2020-11-6     price  1746.7239
9    2020-11-9     price  1756.1005
10  2020-11-10     price  1752.0820
11  2020-11-11     price  1814.3693
12  2020-11-12     price  1833.7922
13  2020-11-13     price  1833.7922
14  2020-11-16     price  1784.2302
15  2020-11-17     price  1764.1376
16  2020-11-18     price  1770.1654
17  2020-11-19     price  1757.4400
18  2020-11-20     price  1770.1654

To get week number of each date, I use df['week_number'] = pd.to_datetime(df['date']).dt.week.

          date  variable      value  week_number
0    2020-11-4  quantity   564.0000           45  --> to keep
1   2020-11-11  quantity   565.0000           46  --> to keep
2   2020-11-18  quantity   566.0000           47  --> to keep
3   2020-11-25  quantity   566.0000           48  --> to keep
4    2020-11-2     price  1829.1039           45
5    2020-11-3     price  1789.5883           45
6    2020-11-4     price  1755.4307           45
7    2020-11-5     price  1750.0727           45
8    2020-11-6     price  1746.7239           45  --> to keep, since it's the largest weekday for this week
9    2020-11-9     price  1756.1005           46
10  2020-11-10     price  1752.0820           46
11  2020-11-11     price  1814.3693           46
12  2020-11-12     price  1833.7922           46
13  2020-11-13     price  1833.7922           46  --> to keep, since it's the largest weekday for this week
14  2020-11-16     price  1784.2302           47
15  2020-11-17     price  1764.1376           47
16  2020-11-18     price  1770.1654           47
17  2020-11-19     price  1757.4400           47
18  2020-11-20     price  1770.1654           47  --> to keep, since it's the largest weekday for this week

Finally, I will reshape rows indicating to_keep to the expected result as follow:

   variable  the_45th_week  the_46th_week  the_47th_week  the_48th_week
0  quantity       564.0000       565.0000       566.0000          566.0
1     price      1756.1005      1833.7922      1770.1654            NaN

How could I manipulate data to get the expected result? Sincere thanks.

EDIT:

df = df.sort_values(by=['variable','date'], ascending=False)
df.drop_duplicates(['variable', 'week_number'], keep='last')

Out:

          date  variable      value  week_number
0    2020-11-4  quantity   564.0000           45
3   2020-11-25  quantity   566.0000           48
2   2020-11-18  quantity   566.0000           47
1   2020-11-11  quantity   565.0000           46
4    2020-11-2     price  1829.1039           45
14  2020-11-16     price  1784.2302           47
10  2020-11-10     price  1752.0820           46

Upvotes: 1

Views: 245

Answers (1)

jezrael
jezrael

Reputation: 862521

In your solution is possible add pivot with rename:

df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)
df = df.drop_duplicates(['variable', 'week_number'], keep='last')

f = lambda x: f'the_{x}th_week'
out = df.pivot('variable','week_number','value').rename(columns=f)
print(out)
week_number  the_45th_week  the_46th_week  the_47th_week  the_48th_week
variable                                                               
price            1829.1039       1752.082      1784.2302            NaN
quantity          564.0000        565.000       566.0000          566.0

Or remove DataFrame.drop_duplicates, so is possible use DataFrame.pivot_table with aggregate function last:

df['week_number'] = pd.to_datetime(df['date']).dt.week
df = df.sort_values(by=['variable','date'], ascending=False)

f = lambda x: f'the_{x}th_week'
out = df.pivot_table(index='variable',columns='week_number',values='value', aggfunc='last').rename(columns=f)

EDIT: to get an exact same result as the expected one:

out.reset_index().rename_axis(None, axis=1)

Out:

   variable  the_45th_week  the_46th_week  the_47th_week  the_48th_week
0     price      1829.1039       1752.082      1784.2302            NaN
1  quantity       564.0000        565.000       566.0000          566.0

Upvotes: 1

Related Questions