Reputation: 146
I have a dataframe that contains a time series with hourly data form 2015 to 2020. I want to create a new dataframe that has a column with the values of the time series for each year or for each month of each year to perform a separate analysis. As I have 1 leap year, I want them to share index but have a NaN value at that position (29 Feb) on the years that are not leap. I tried using merge creating two new columns called month and day_of_month but index gets crazy and ends up having millions of entries instead of the ~40.000 it should have, and in the end it ends up with a more than 20GB file on RAM and breaks:
years = pd.DataFrame(index=pd.date_range('2016-01-01', '2017-01-01', freq='1H'))
years['month'] = years.index.month
years['day_of_month'] = years.index.day
gp = data_md[['value', 'month', 'day_of_month']].groupby(pd.Grouper(freq='1Y'))
for name, group in gp:
years = years.merge(group, right_on=['month', 'day_of_month'], left_on=['month', 'day_of_month'])
RESULT:
month day_of_month value
0 1 1 0
1 1 1 6
2 1 1 2
3 1 1 0
4 1 1 1
... ... ... ...
210259 12 31 6
210260 12 31 2
210261 12 31 4
210262 12 31 5
210263 12 31 1
How can I get the frame constructed having one value column for each single year or month?
Here I leave the original frame from which I want to create the new one, only needed column by now is value
value month day_of_month week day_name year hour season dailyp day_of_week ... hourly_no_noise daily_trend daily_seasonal daily_residuals daily_no_noise daily_trend_h daily_seasonal_h daily_residuals_h daily_no_noise_h Total
date
2015-01-01 00:00:00 0 1 1 1 Thursday 2015 0 Invierno 165.0 3 ... NaN NaN -9.053524 NaN NaN NaN -3.456929 NaN NaN 6436996.0
2015-01-01 01:00:00 6 1 1 1 Thursday 2015 1 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -4.879983 NaN NaN NaN
2015-01-01 02:00:00 2 1 1 1 Thursday 2015 2 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -5.895367 NaN NaN NaN
2015-01-01 03:00:00 0 1 1 1 Thursday 2015 3 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.468616 NaN NaN NaN
2015-01-01 04:00:00 1 1 1 1 Thursday 2015 4 Invierno NaN 3 ... NaN NaN -9.053524 NaN NaN NaN -6.441830 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2019-12-31 19:00:00 6 12 31 1 Tuesday 2019 19 Invierno NaN 1 ... 11.529465 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.805720 1.176491 17.823509 NaN
2019-12-31 20:00:00 3 12 31 1 Tuesday 2019 20 Invierno NaN 1 ... 11.314857 230.571429 -4.997480 -11.299166 237.299166 9.613095 2.928751 1.176491 17.823509 NaN
2019-12-31 21:00:00 3 12 31 1 Tuesday 2019 21 Invierno NaN 1 ... 10.141139 230.571429 -4.997480 -11.299166 237.299166 9.613095 1.774848 1.176491 17.823509 NaN
2019-12-31 22:00:00 3 12 31 1 Tuesday 2019 22 Invierno NaN 1 ... 8.823152 230.571429 -4.997480 -11.299166 237.299166 9.613095 0.663344 1.176491 17.823509 NaN
2019-12-31 23:00:00 6 12 31 1 Tuesday 2019 23 Invierno NaN 1 ... 6.884636 230.571429 -4.997480 -11.299166 237.299166 9.613095 -1.624980 1.176491 17.823509 NaN
I would like to end up with a dataframe like this:
2015 2016 2017 2018 2019
2016-01-01 00:00:00 0.074053 0.218161 0.606810 0.687365 0.352672
2016-01-01 01:00:00 0.465167 0.210297 0.722825 0.683341 0.885175
2016-01-01 02:00:00 0.175964 0.610560 0.722479 0.016842 0.205916
2016-01-01 03:00:00 0.945955 0.807490 0.627525 0.187677 0.535116
2016-01-01 04:00:00 0.757608 0.797835 0.639215 0.455989 0.042285
... ... ... ... ... ...
2016-12-30 20:00:00 0.046138 0.139100 0.397547 0.738687 0.335306
2016-12-30 21:00:00 0.672800 0.802090 0.617625 0.787601 0.007535
2016-12-30 22:00:00 0.698141 0.776686 0.423712 0.667808 0.298338
2016-12-30 23:00:00 0.198089 0.642073 0.586527 0.106567 0.514569
2016-12-31 00:00:00 0.367572 0.390791 0.105193 0.592167 0.007365
where 29 Feb is NaN on non-leap years:
df['2016-02']
2015 2016 2017 2018 2019
2016-02-01 00:00:00 0.656703 0.348784 0.383639 0.208786 0.183642
2016-02-01 01:00:00 0.488729 0.909498 0.873642 0.122028 0.547563
2016-02-01 02:00:00 0.210427 0.912393 0.505873 0.085149 0.358841
2016-02-01 03:00:00 0.281107 0.534750 0.622473 0.643611 0.258437
2016-02-01 04:00:00 0.187434 0.327459 0.701008 0.887041 0.385816
... ... ... ... ... ...
2016-02-29 19:00:00 NaN 0.742402 NaN NaN NaN
2016-02-29 20:00:00 NaN 0.013419 NaN NaN NaN
2016-02-29 21:00:00 NaN 0.517194 NaN NaN NaN
2016-02-29 22:00:00 NaN 0.003136 NaN NaN NaN
2016-02-29 23:00:00 NaN 0.128406 NaN NaN NaN
Upvotes: 0
Views: 638
Reputation: 26211
IIUC, you just need the original DataFrame:
origin = 2016 # or whatever year of your chosing
newidx = pd.to_datetime(df.index.strftime(f'{origin}-%m-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(year=df.index.year)
.set_axis(newidx, axis=0)
.pivot(columns='year', values='value')
)
Using the small sample data you provided for that "original frame" df
, we get:
>>> newdf
year 2015 2019
date
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 6.0 NaN
2016-01-01 02:00:00 2.0 NaN
... ... ...
2016-12-31 21:00:00 NaN 3.0
2016-12-31 22:00:00 NaN 3.0
2016-12-31 23:00:00 NaN 6.0
On a larger (made-up) DataFrame:
np.random.seed(0)
ix = pd.date_range('2015', '2020', freq='H', inclusive='left')
df = pd.DataFrame({'value': np.random.randint(0, 100, len(ix))}, index=ix)
# (code above)
>>> newdf
year 2015 2016 2017 2018 2019
2016-01-01 00:00:00 44.0 82.0 96.0 68.0 71.0
2016-01-01 01:00:00 47.0 99.0 54.0 44.0 71.0
2016-01-01 02:00:00 64.0 28.0 11.0 10.0 55.0
... ... ... ... ... ...
2016-12-31 21:00:00 0.0 30.0 28.0 53.0 14.0
2016-12-31 22:00:00 47.0 82.0 19.0 6.0 64.0
2016-12-31 23:00:00 22.0 75.0 13.0 37.0 35.0
and, as expected, only 2016 has values for 02/29:
>>> newdf[:'2016-02-29 02:00:00'].tail()
year 2015 2016 2017 2018 2019
2016-02-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-02-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-02-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-02-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-02-29 02:00:00 NaN 67.0 NaN NaN NaN
The code above can easily be adapted for month columns:
Either using MultiIndex columns:
origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(year=df.index.year, month=df.index.month)
.set_axis(newidx, axis=0)
.pivot(columns=['year', 'month'], values='value')
)
>>> newdf
year 2015 ... 2019
month 1 2 3 4 5 6 7 8 9 10 ... 3 4 5 6 7 8 9 10 11 12
2016-01-01 00:00:00 44.0 49.0 40.0 60.0 71.0 67.0 63.0 16.0 71.0 78.0 ... 32.0 35.0 51.0 35.0 68.0 43.0 4.0 23.0 65.0 19.0
2016-01-01 01:00:00 47.0 71.0 27.0 88.0 68.0 58.0 74.0 67.0 98.0 49.0 ... 85.0 27.0 70.0 8.0 9.0 29.0 78.0 29.0 21.0 68.0
2016-01-01 02:00:00 64.0 90.0 4.0 61.0 95.0 3.0 57.0 41.0 28.0 24.0 ... 7.0 93.0 21.0 10.0 72.0 79.0 46.0 45.0 25.0 99.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-01-31 21:00:00 48.0 NaN 24.0 NaN 79.0 NaN 55.0 47.0 NaN 20.0 ... 87.0 NaN 19.0 NaN 56.0 76.0 NaN 91.0 NaN 14.0
2016-01-31 22:00:00 82.0 NaN 6.0 NaN 46.0 NaN 9.0 57.0 NaN 21.0 ... 69.0 NaN 67.0 NaN 85.0 38.0 NaN 34.0 NaN 64.0
2016-01-31 23:00:00 51.0 NaN 97.0 NaN 45.0 NaN 55.0 41.0 NaN 87.0 ... 94.0 NaN 80.0 NaN 37.0 81.0 NaN 98.0 NaN 35.0
or a simple string column made of %Y-%m
to indicate year/month:
origin = 2016
newidx = pd.to_datetime(df.index.strftime(f'{origin}-01-%d %H:%M:%S'))
newdf = (
df[['value']]
.assign(ym=df.index.strftime(f'%Y-%m'))
.set_axis(newidx, axis=0)
.pivot(columns='ym', values='value')
)
>>> newdf
ym 2015-01 2015-02 2015-03 2015-04 2015-05 2015-06 2015-07 2015-08 2015-09 2015-10 ... 2019-03 2019-04 2019-05 2019-06 2019-07 2019-08 2019-09 \
2016-01-01 00:00:00 44.0 49.0 40.0 60.0 71.0 67.0 63.0 16.0 71.0 78.0 ... 32.0 35.0 51.0 35.0 68.0 43.0 4.0
2016-01-01 01:00:00 47.0 71.0 27.0 88.0 68.0 58.0 74.0 67.0 98.0 49.0 ... 85.0 27.0 70.0 8.0 9.0 29.0 78.0
2016-01-01 02:00:00 64.0 90.0 4.0 61.0 95.0 3.0 57.0 41.0 28.0 24.0 ... 7.0 93.0 21.0 10.0 72.0 79.0 46.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-01-31 21:00:00 48.0 NaN 24.0 NaN 79.0 NaN 55.0 47.0 NaN 20.0 ... 87.0 NaN 19.0 NaN 56.0 76.0 NaN
2016-01-31 22:00:00 82.0 NaN 6.0 NaN 46.0 NaN 9.0 57.0 NaN 21.0 ... 69.0 NaN 67.0 NaN 85.0 38.0 NaN
2016-01-31 23:00:00 51.0 NaN 97.0 NaN 45.0 NaN 55.0 41.0 NaN 87.0 ... 94.0 NaN 80.0 NaN 37.0 81.0 NaN
ym 2019-10 2019-11 2019-12
2016-01-01 00:00:00 23.0 65.0 19.0
2016-01-01 01:00:00 29.0 21.0 68.0
2016-01-01 02:00:00 45.0 25.0 99.0
... ... ... ...
2016-01-31 21:00:00 91.0 NaN 14.0
2016-01-31 22:00:00 34.0 NaN 64.0
2016-01-31 23:00:00 98.0 NaN 35.0
The former gives you more flexibility to index sub-parts. For example, here is a selection of rows for "all February months":
>>> newdf.loc[:'2016-01-29 02:00:00', (slice(None), 2)].tail()
year 2015 2016 2017 2018 2019
month 2 2 2 2 2
2016-01-28 22:00:00 74.0 54.0 22.0 17.0 39.0
2016-01-28 23:00:00 37.0 61.0 31.0 8.0 62.0
2016-01-29 00:00:00 NaN 34.0 NaN NaN NaN
2016-01-29 01:00:00 NaN 82.0 NaN NaN NaN
2016-01-29 02:00:00 NaN 67.0 NaN NaN NaN
Upvotes: 1
Reputation: 396
So let's assume we have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(pd.date_range('2015-01-01', '2020-01-01', freq='1H'),
columns = ['Date and Time'])
df['str'] = df['Date and Time'].dt.strftime('%Y-%m-%d')
df[['Year', 'Month','Day']] = df['str'].apply(lambda x: pd.Series(str(x).split("-")))
df['Values'] = np.random.rand(len(df))
print(df)
Output:
Date and Time str Year Month Day Values
0 2015-01-01 00:00:00 2015-01-01 2015 01 01 0.153948
1 2015-01-01 01:00:00 2015-01-01 2015 01 01 0.663132
2 2015-01-01 02:00:00 2015-01-01 2015 01 01 0.141534
3 2015-01-01 03:00:00 2015-01-01 2015 01 01 0.263551
4 2015-01-01 04:00:00 2015-01-01 2015 01 01 0.094391
... ... ... ... ... .. ...
43820 2019-12-31 20:00:00 2019-12-31 2019 12 31 0.055802
43821 2019-12-31 21:00:00 2019-12-31 2019 12 31 0.952963
43822 2019-12-31 22:00:00 2019-12-31 2019 12 31 0.106768
43823 2019-12-31 23:00:00 2019-12-31 2019 12 31 0.834583
43824 2020-01-01 00:00:00 2020-01-01 2020 01 01 0.325849
[43825 rows x 6 columns]
Now we separate the dataframe by year and save it in a disk:
d = {}
for i in range(2015,2020):
d[i] = pd.DataFrame(df[df['Year'] == str(i)])
d[i].sort_values(by = 'Date and Time',inplace=True,ignore_index=True)
for i in range(2015,2020):
print('Feb', i,':',(d[i][d[i]['Month'] == '02']).shape)
print((d[i][d[i]['Month'] == '02']).tail(3))
print('-----------------------------------------------------------------')
Output:
Feb 2015 : (672, 6)
Date and Time str Year Month Day Values
1413 2015-02-28 21:00:00 2015-02-28 2015 02 28 0.517525
1414 2015-02-28 22:00:00 2015-02-28 2015 02 28 0.404741
1415 2015-02-28 23:00:00 2015-02-28 2015 02 28 0.299090
-----------------------------------------------------------------
Feb 2016 : (696, 6)
Date and Time str Year Month Day Values
1437 2016-02-29 21:00:00 2016-02-29 2016 02 29 0.854047
1438 2016-02-29 22:00:00 2016-02-29 2016 02 29 0.035787
1439 2016-02-29 23:00:00 2016-02-29 2016 02 29 0.955364
-----------------------------------------------------------------
Feb 2017 : (672, 6)
Date and Time str Year Month Day Values
1413 2017-02-28 21:00:00 2017-02-28 2017 02 28 0.936354
1414 2017-02-28 22:00:00 2017-02-28 2017 02 28 0.954680
1415 2017-02-28 23:00:00 2017-02-28 2017 02 28 0.625131
-----------------------------------------------------------------
Feb 2018 : (672, 6)
Date and Time str Year Month Day Values
1413 2018-02-28 21:00:00 2018-02-28 2018 02 28 0.965274
1414 2018-02-28 22:00:00 2018-02-28 2018 02 28 0.848050
1415 2018-02-28 23:00:00 2018-02-28 2018 02 28 0.238984
-----------------------------------------------------------------
Feb 2019 : (672, 6)
Date and Time str Year Month Day Values
1413 2019-02-28 21:00:00 2019-02-28 2019 02 28 0.476142
1414 2019-02-28 22:00:00 2019-02-28 2019 02 28 0.498278
1415 2019-02-28 23:00:00 2019-02-28 2019 02 28 0.127525
-----------------------------------------------------------------
To fix the leap year problem:
There is definitely a better way, but the only thing I can think of is to create the value rows, add them, and then join the dataframes.
indexs = list(range(1416,1440))
lines = pd.DataFrame(np.nan ,columns = df.columns.values , index = indexs)
print(lines.head())
Output:
Date and Time str Year Month Day Values
1416 NaN NaN NaN NaN NaN NaN
1417 NaN NaN NaN NaN NaN NaN
1418 NaN NaN NaN NaN NaN NaN
1419 NaN NaN NaN NaN NaN NaN
1420 NaN NaN NaN NaN NaN NaN
Then I add the NaN rows to the data frame with the following code:
b = {}
for i in range(2015,2020):
if list(d[i][d[i]['Month'] == '02'].tail(1)['Day'])[0] == '28':
bi = pd.concat([d[i].iloc[0:1416], lines]).reset_index(drop=True)
b[i] = pd.concat([bi,d[i].iloc[1416:8783]]).reset_index(drop=True)
else:
b[i] = d[i].copy()
for i in range(2015,2020):
print(i,':',b[i].shape)
print(b[i].iloc[1438:1441])
print('-----------------------------------------------------------------')
Output:
2015 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2015-03-01 2015-03-01 2015 03 01 0.676486
-----------------------------------------------------------------
2016 : (8784, 6)
Date and Time str Year Month Day Values
1438 2016-02-29 22:00:00 2016-02-29 2016 02 29 0.035787
1439 2016-02-29 23:00:00 2016-02-29 2016 02 29 0.955364
1440 2016-03-01 00:00:00 2016-03-01 2016 03 01 0.014158
-----------------------------------------------------------------
2017 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2017-03-01 2017-03-01 2017 03 01 0.035952
-----------------------------------------------------------------
2018 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2018-03-01 2018-03-01 2018 03 01 0.44876
-----------------------------------------------------------------
2019 : (8784, 6)
Date and Time str Year Month Day Values
1438 NaT NaN NaN NaN NaN NaN
1439 NaT NaN NaN NaN NaN NaN
1440 2019-03-01 2019-03-01 2019 03 01 0.096433
-----------------------------------------------------------------
And finally, if we want to create the dataframe you want:
final_df = pd.DataFrame(index = b[2016]['Date and Time'])
for i in range(2015,2020):
final_df[i] = np.array(b[i]['Values'])
Output:
2015 2016 2017 2018 2019
Date and Time
2016-01-01 00:00:00 0.153948 0.145602 0.957265 0.427620 0.868948
2016-01-01 01:00:00 0.663132 0.318746 0.013658 0.380105 0.442332
2016-01-01 02:00:00 0.141534 0.483471 0.048050 0.139065 0.702211
2016-01-01 03:00:00 0.263551 0.737948 0.528827 0.472889 0.165095
2016-01-01 04:00:00 0.094391 0.939737 0.120343 0.134011 0.297611
... ... ... ... ... ...
2016-02-28 22:00:00 0.404741 0.864423 0.954680 0.848050 0.498278
2016-02-28 23:00:00 0.299090 0.348466 0.625131 0.238984 0.127525
2016-02-29 00:00:00 NaN 0.375469 NaN NaN NaN
2016-02-29 01:00:00 NaN 0.186092 NaN NaN NaN
... ... ... ... ... ...
2016-02-29 22:00:00 NaN 0.035787 NaN NaN NaN
2016-02-29 23:00:00 NaN 0.955364 NaN NaN NaN
2016-03-01 00:00:00 0.676486 0.014158 0.035952 0.448760 0.096433
2016-03-01 01:00:00 0.792168 0.520436 0.138874 0.229396 0.913848
... ... ... ... ... ...
2016-12-31 19:00:00 0.517459 0.956219 0.116335 0.736170 0.739740
2016-12-31 20:00:00 0.814362 0.324332 0.324911 0.485508 0.055802
2016-12-31 21:00:00 0.870459 0.809150 0.335461 0.124459 0.952963
2016-12-31 22:00:00 0.549891 0.043623 0.997053 0.144286 0.106768
2016-12-31 23:00:00 0.047090 0.730074 0.698159 0.235253 0.834583
[8784 rows x 5 columns]
Upvotes: 1