Reputation: 181
i have the following dataframe structure:
exec_start_date exec_finish_date hour_start hour_finish session_qtd
2020-03-01 2020-03-02 22 0 1
2020-03-05 2020-03-05 22 23 3
2020-03-03 2020-03-04 18 7 4
2020-03-07 2020-03-07 18 18 2
As you can see above, i have three situations of sessions execution:
1) Start in one day and finish in another day with different hours
2) Start in one day and finish in the same day with different hours
3) Start in one day and finish in the same day with same hours
I need to create a column filling the interval between hour_start and hour_finish and create another column with execution date. Then:
So, a intermediary dataset would be like this:
exec_start_date exec_finish_date hour_start hour_finish session_qtd hour_interval
2020-03-01 2020-03-02 22 0 1 [22,23,0]
2020-03-05 2020-03-05 22 23 3 [22,23]
2020-03-03 2020-03-04 20 3 4 [20,21,22,23,0,1,2,3]
2020-03-07 2020-03-07 18 18 2 [18]
And the final dataset would be like this:
exec_date session_qtd hour_interval
2020-03-01 1 22
2020-03-01 1 23
2020-03-02 1 0
2020-03-05 3 22
2020-03-05 3 23
2020-03-03 4 20
2020-03-03 4 21
2020-03-03 4 22
2020-03-03 4 23
2020-03-04 4 0
2020-03-04 4 1
2020-03-04 4 2
2020-03-04 4 3
2020-03-07 2 18
I have tried to create the interval with np.arange but didn't work properly to all cases, specially with the cases that start in one day and finish in another day.
Can you help me?
Upvotes: 2
Views: 1281
Reputation: 11342
Another approach without converting the date fields:
import pandas as pd
import numpy as np
ss = '''
exec_start_date,exec_finish_date,hour_start,hour_finish,session_qtd
2020-03-01,2020-03-02,22,0,1
2020-03-05,2020-03-05,22,23,3
2020-03-03,2020-03-04,18,7,4
2020-03-07,2020-03-07,18,18,2
'''.strip()
with open('data.csv','w') as f: f.write(ss)
########## main script #############
df = pd.read_csv('data.csv')
df['hour_start'] = df['hour_start'].astype('int')
df['hour_finish'] = df['hour_finish'].astype('int')
# get hour list, group by day
df['hour_interval'] = df.apply(lambda x: list(range(x[2],x[3]+1)) if x[0]==x[1] else [(101,) + tuple(range(x[2],24))]+[(102,)+tuple(range(0,x[3]+1))], axis=1)
lst_col = 'hour_interval'
# split day group to separate rows
hrdf = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
# choose start\end day
hrdf['exec_date'] = hrdf.apply(lambda x: x[0] if type(x[5]) is int or x[5][0] == 101 else x[1], axis=1)
# remove day indicator
hrdf['hour_interval'] = hrdf['hour_interval'].apply(lambda x: [x] if type(x) is int else list(x[1:]))
# split hours to separate rows
df = hrdf
hrdf = pd.DataFrame({
col:np.repeat(df[col].values, df[lst_col].str.len())
for col in df.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]
# columns for final output
df_final = hrdf[['exec_date','session_qtd','hour_interval']]
print(df_final.to_string(index=False))
Output
exec_date session_qtd hour_interval
2020-03-01 1 22
2020-03-01 1 23
2020-03-02 1 0
2020-03-05 3 22
2020-03-05 3 23
2020-03-03 4 18
2020-03-03 4 19
2020-03-03 4 20
2020-03-03 4 21
2020-03-03 4 22
2020-03-03 4 23
2020-03-04 4 0
2020-03-04 4 1
2020-03-04 4 2
2020-03-04 4 3
2020-03-04 4 4
2020-03-04 4 5
2020-03-04 4 6
2020-03-04 4 7
2020-03-07 2 18
Upvotes: 0
Reputation: 26676
Make df dtype string
df=df.astype(str)
Concat date with hour and coerce to datetime
df['exec_start_date']=pd.to_datetime(df['exec_start_date'].str.cat(df['hour_start'], sep=' ')+ ':00:00')
df['exec_finish_date']=pd.to_datetime(df['exec_finish_date'].str.cat(df['hour_finish'], sep=' ')+ ':00:00')
Derive hourly periods between the start and end datetime
df['date'] = df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'), end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)
Explode date to achieve outcome
df.explode('date')
Can skip last step by exploding after deriving hourly periods as follows
df.assign(date=df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'), end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)).explode('date')
Outcome
exec_start_date exec_finish_date hour_start hour_finish session_qtd \
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
0 2020-03-01 22:00:00 2020-03-02 00:00:00 22 0 1
1 2020-03-05 22:00:00 2020-03-05 23:00:00 22 23 3
1 2020-03-05 22:00:00 2020-03-05 23:00:00 22 23 3
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
2 2020-03-03 18:00:00 2020-03-04 07:00:00 18 7 4
3 2020-03-07 18:00:00 2020-03-07 18:00:00 18 18 2
date
0 22
0 23
0 0
1 22
1 23
2 18
2 19
2 20
2 21
2 22
2 23
2 0
2 1
2 2
2 3
2 4
2 5
2 6
2 7
3 18
Upvotes: 1
Reputation: 2579
So the way I would do this is to get the full dates to get the time interval, then just pull the hours from that range. np.arange
will not work because hours loop.
#Create two new temp columns that calculate have the full start and end date
df['start'] = df['exec_start_date'].astype(str) + " " + df['hour_start'].astype(str) + ":00.000"
df['end'] = df['exec_finish_date'].astype(str) + " " + df['hour_finish'].astype(str) + ":00.000"
#Convert them both to datetimes
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
#Apply a function to get your range
df['date'] = df.apply(lambda x: pd.date_range(x['start'], x['end'], freq='H').tolist(), axis = 1)
#explode new date column to get 1 column for all the created interval dates
df = df.explode(column = 'date')
#Make two new columns based on your final requested table based on new column
df['exec_date'] = df['date'].dt.date
df['hour_interval'] = df['date'].dt.hour
#Make a copy of the columns you wanted in a new df
df2 = df[['exec_date','session_qtd','hour_interval']].copy()
Upvotes: 2