Reputation: 113
I'm working with a data set containing information on a phenomenon occurring during some time frames. I am given the start and end time of the event and its severity, as well as some other information. I would like to expand these frames over some larger time period by expanding the rows within set time periods and leaving the rest of the information as NaNs.
Data set example:
date_end severity category
date_start
2018-01-04 07:00:00 2018-01-04 10:00:00 12 1
2018-01-04 12:00:00 2018-01-04 13:00:00 44 2
What I want is:
severity category
date_start
2018-01-04 07:00:00 12 1
2018-01-04 08:00:00 12 1
2018-01-04 09:00:00 12 1
2018-01-04 10:00:00 12 1
2018-01-04 11:00:00 nan nan
2018-01-04 12:00:00 44 2
2018-01-04 13:00:00 44 2
2018-01-04 14:00:00 nan nan
2018-01-04 15:00:00 nan nan
What would be an efficient way of achieving such a result?
Upvotes: 11
Views: 9632
Reputation: 153460
One way to do this is to reindex the datafame using pd.date_range, the use ffill and mask the values where the index is greater than date_end.
df.index = pd.to_datetime(df.index)
df['date_end'] = pd.to_datetime(df['date_end'])
df1 = df.reindex(pd.date_range(df.index.min(), '2018-01-04 15:00:00', freq='H'))
df1 = df1.ffill()
df1.loc[(df1.index - df1['date_end']) > pd.Timedelta(days=0)] = np.nan
df_out = df1.drop('date_end', axis=1)
print(df_out)
Output:
severity category
2018-01-04 07:00:00 12.0 1.0
2018-01-04 08:00:00 12.0 1.0
2018-01-04 09:00:00 12.0 1.0
2018-01-04 10:00:00 12.0 1.0
2018-01-04 11:00:00 NaN NaN
2018-01-04 12:00:00 44.0 2.0
2018-01-04 13:00:00 44.0 2.0
2018-01-04 14:00:00 NaN NaN
2018-01-04 15:00:00 NaN NaN
Upvotes: 3
Reputation: 93141
Assuming you are on pandas v0.25, use explode
:
df['hour'] = df.apply(lambda row: pd.date_range(row.name, row['date_end'], freq='H'), axis=1)
df = df.explode('hour').reset_index() \
.drop(columns=['date_start', 'date_end']) \
.rename(columns={'hour': 'date_start'}) \
.set_index('date_start')
For the rows with nan
, you may reindex your dataframe.
# Report from Jan 4 - 5, 2018, from 7AM - 7PM
days = pd.date_range('2018-01-04', '2018-01-05')
hours = pd.to_timedelta(range(7, 20), unit='h')
tmp = pd.MultiIndex.from_product([days, hours], names=['Date', 'Hour']).to_frame()
s = tmp['Date'] + tmp['Hour']
df.reindex(s)
Upvotes: 11