i have the following dataframe structure: exec_start_date exec_finish_date hour_start hour_finish session_qtd 2020-03-01 2020-03-02 22 0 1 2020-03-05 2020-03-05 22 23 3 2020-03-03 2020-03-04 18 7 4 2020-03-07 2020-03-07 18 18 2 As you can see above, i have three situations of sessions execution: 1) Start in one day and finish in another day with different hours 2) Start in one day and finish in the same day with different hours 3) Start in one day and finish in the same day with same hours I need to create a column filling the interval between hour_start and hour_finish and create another column with execution date. Then: If a session starts in one day and finish in another day, the hours executed in the start date need to be filled with exec_start_date and the hours executed in the following day need to be filled with the exec_finish_date . So, a intermediary dataset would be like this: exec_start_date exec_finish_date hour_start hour_finish session_qtd hour_interval 2020-03-01 2020-03-02 22 0 1 [22,23,0] 2020-03-05 2020-03-05 22 23 3 [22,23] 2020-03-03 2020-03-04 20 3 4 [20,21,22,23,0,1,2,3] 2020-03-07 2020-03-07 18 18 2 [18] And the final dataset would be like this: exec_date session_qtd hour_interval 2020-03-01 1 22 2020-03-01 1 23 2020-03-02 1 0 2020-03-05 3 22 2020-03-05 3 23 2020-03-03 4 20 2020-03-03 4 21 2020-03-03 4 22 2020-03-03 4 23 2020-03-04 4 0 2020-03-04 4 1 2020-03-04 4 2 2020-03-04 4 3 2020-03-07 2 18 I have tried to create the interval with np.arange but didn't work properly to all cases, specially with the cases that start in one day and finish in another day. Can you help me?

Reputation: 181

How to create a column with hour interval from two columns in Python?

i have the following dataframe structure:

exec_start_date  exec_finish_date  hour_start  hour_finish session_qtd 
2020-03-01       2020-03-02          22           0            1
2020-03-05       2020-03-05          22           23           3 
2020-03-03       2020-03-04          18           7            4
2020-03-07       2020-03-07          18           18           2

As you can see above, i have three situations of sessions execution:

1) Start in one day and finish in another day with different hours

2) Start in one day and finish in the same day with different hours

3) Start in one day and finish in the same day with same hours

I need to create a column filling the interval between hour_start and hour_finish and create another column with execution date. Then:

If a session starts in one day and finish in another day, the hours executed in the start date need to be filled with exec_start_date and the hours executed in the following day need to be filled with the exec_finish_date.

So, a intermediary dataset would be like this:

exec_start_date  exec_finish_date  hour_start  hour_finish session_qtd   hour_interval
    2020-03-01       2020-03-02          22           0            1       [22,23,0]
    2020-03-05       2020-03-05          22           23           3       [22,23]
    2020-03-03       2020-03-04          20           3            4       [20,21,22,23,0,1,2,3]
    2020-03-07       2020-03-07          18           18           2       [18]

And the final dataset would be like this:

exec_date  session_qtd  hour_interval
2020-03-01     1             22
2020-03-01     1             23
2020-03-02     1             0
2020-03-05     3             22
2020-03-05     3             23
2020-03-03     4             20
2020-03-03     4             21
2020-03-03     4             22
2020-03-03     4             23
2020-03-04     4             0
2020-03-04     4             1
2020-03-04     4             2
2020-03-04     4             3
2020-03-07     2             18

I have tried to create the interval with np.arange but didn't work properly to all cases, specially with the cases that start in one day and finish in another day.

Can you help me?

Upvotes: 2

Answers (3)

Mike67

Reputation: 11342

Another approach without converting the date fields:

import pandas as pd
import numpy as np

ss = '''
exec_start_date,exec_finish_date,hour_start,hour_finish,session_qtd
2020-03-01,2020-03-02,22,0,1
2020-03-05,2020-03-05,22,23,3
2020-03-03,2020-03-04,18,7,4
2020-03-07,2020-03-07,18,18,2
'''.strip()

with open('data.csv','w') as f: f.write(ss)

########## main script #############

df = pd.read_csv('data.csv')
df['hour_start'] = df['hour_start'].astype('int')
df['hour_finish'] = df['hour_finish'].astype('int')

# get hour list, group by day
df['hour_interval'] = df.apply(lambda x: list(range(x[2],x[3]+1)) if x[0]==x[1] else [(101,) + tuple(range(x[2],24))]+[(102,)+tuple(range(0,x[3]+1))], axis=1)

lst_col = 'hour_interval'

# split day group to separate rows
hrdf = pd.DataFrame({
      col:np.repeat(df[col].values, df[lst_col].str.len())
      for col in df.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

# choose start\end day
hrdf['exec_date'] = hrdf.apply(lambda x: x[0] if type(x[5]) is int or x[5][0] == 101 else x[1], axis=1)

# remove day indicator
hrdf['hour_interval'] = hrdf['hour_interval'].apply(lambda x: [x] if type(x) is int else list(x[1:]))

# split hours to separate rows
df = hrdf
hrdf = pd.DataFrame({
      col:np.repeat(df[col].values, df[lst_col].str.len())
      for col in df.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

# columns for final output
df_final = hrdf[['exec_date','session_qtd','hour_interval']]

print(df_final.to_string(index=False))

Output

  exec_date  session_qtd  hour_interval
 2020-03-01            1             22
 2020-03-01            1             23
 2020-03-02            1              0
 2020-03-05            3             22
 2020-03-05            3             23
 2020-03-03            4             18
 2020-03-03            4             19
 2020-03-03            4             20
 2020-03-03            4             21
 2020-03-03            4             22
 2020-03-03            4             23
 2020-03-04            4              0
 2020-03-04            4              1
 2020-03-04            4              2
 2020-03-04            4              3
 2020-03-04            4              4
 2020-03-04            4              5
 2020-03-04            4              6
 2020-03-04            4              7
 2020-03-07            2             18

Upvotes: 0

wwnde

Reputation: 26676

Make df dtype string

df=df.astype(str)

Concat date with hour and coerce to datetime

df['exec_start_date']=pd.to_datetime(df['exec_start_date'].str.cat(df['hour_start'], sep=' ')+ ':00:00')
df['exec_finish_date']=pd.to_datetime(df['exec_finish_date'].str.cat(df['hour_finish'], sep=' ')+ ':00:00')

Derive hourly periods between the start and end datetime

df['date'] = df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'),  end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)

Explode date to achieve outcome

df.explode('date')

Can skip last step by exploding after deriving hourly periods as follows

df.assign(date=df.apply(lambda x: pd.period_range(start=pd.Period(x['exec_start_date'],freq='H'),  end=pd.Period(x['exec_finish_date'],freq='H'), freq='H').hour.tolist(), axis = 1)).explode('date')

Outcome

 exec_start_date    exec_finish_date hour_start hour_finish session_qtd  \
0 2020-03-01 22:00:00 2020-03-02 00:00:00         22           0           1   
0 2020-03-01 22:00:00 2020-03-02 00:00:00         22           0           1   
0 2020-03-01 22:00:00 2020-03-02 00:00:00         22           0           1   
1 2020-03-05 22:00:00 2020-03-05 23:00:00         22          23           3   
1 2020-03-05 22:00:00 2020-03-05 23:00:00         22          23           3   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
2 2020-03-03 18:00:00 2020-03-04 07:00:00         18           7           4   
3 2020-03-07 18:00:00 2020-03-07 18:00:00         18          18           2   

  date  
0   22  
0   23  
0    0  
1   22  
1   23  
2   18  
2   19  
2   20  
2   21  
2   22  
2   23  
2    0  
2    1  
2    2  
2    3  
2    4  
2    5  
2    6  
2    7  
3   18

Upvotes: 1

Ben Pap

Reputation: 2579

So the way I would do this is to get the full dates to get the time interval, then just pull the hours from that range. np.arange will not work because hours loop.

#Create two new temp columns that calculate have the full start and end date
df['start'] = df['exec_start_date'].astype(str) + " " + df['hour_start'].astype(str) + ":00.000"
df['end'] = df['exec_finish_date'].astype(str) + " " + df['hour_finish'].astype(str) + ":00.000"

#Convert them both to datetimes
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])

#Apply a function to get your range
df['date'] = df.apply(lambda x: pd.date_range(x['start'], x['end'], freq='H').tolist(), axis = 1)

#explode new date column to get 1 column for all the created interval dates
df = df.explode(column = 'date')

#Make two new columns based on your final requested table based on new column
df['exec_date'] = df['date'].dt.date
df['hour_interval'] = df['date'].dt.hour

#Make a copy of the columns you wanted in a new df
df2 = df[['exec_date','session_qtd','hour_interval']].copy()

Upvotes: 2

How to create a column with hour interval from two columns in Python?

Answers (3)

Related Questions