Pandas dataframe create new rows based on condition from another column

Question

I have dataframe:

data = {'startTime':['01-06-2010 09:00:00', '13-02-2016 09:00:00', '18-03-2018 09:00:00', '23-05-2011 09:00:00'], 'endTime':['02-06-2010 17:00:00', '14-02-2016 17:00:00', '19-03-2018 17:00:00', '24-05-2011 17:00:00'], 'durationInMinutes': [1440, 1440, 1440, 1440]}

df = pd.DataFrame(data)

I want to make additional rows by dividing 1440 minutes into equal intervals of 8 hours per day. So, 1440 minutes would be equal to 3 days (3 additional rows) (9am - 5pm). Minutes can be more than 1440. New rows in startTime and endTime would be:

startTime               endTime

01-06-2010 09:00:00     01-06-2010 17:00:00
02-06-2010 09:00:00     02-06-2010 17:00:00
03-06-2010 09:00:00     03-06-2010 17:00:00
13-02-2016 09:00:00     13-02-2016 17:00:00
14-02-2016 09:00:00     14-02-2016 17:00:00
15-02-2016 09:00:00     15-02-2016 17:00:00

Can anyone please help me with this? Thanks.

constantstranger · Accepted Answer

UPDATE #2:

Here is a way to do what the question asks, as clarified by multiple OP comments, which is to:

for each existing row, create enough rows in the result to accommodate consecutive days with periods of 8 hours (or, for the final day, up to 8 hours) such that the aggregate duration of this group of result rows matches the durationInMinutes column value in the input.
set durationInMinutes for each result row to match the above logic.
ensure that the startTime of each result row is 09:00 on the appropriate date.
ensure that the endTime of each result row is a period of durationInMinutes later than startTime to match the above logic.

df['days'] = df.durationInMinutes // (8 * 60) + (df.durationInMinutes % (8 * 60) > 0)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1) + 
    [x.durationInMinutes % (8 * 60) + (x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = pd.to_datetime(df.startTime, dayfirst=True)
df.startTime = pd.to_datetime(pd.DataFrame({
    'year':df.startTime.dt.year, 'month':df.startTime.dt.month, 
    'day':df.startTime.dt.day + df.daysToAdd, 
    'hour':[9]*len(df.index)}))
df.endTime = (df.startTime.astype('int64') + 
        df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00                475
1  13-02-2016 08:30:00  14-02-2016 17:00:00                510
2  18-03-2018 09:30:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

            startTime             endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-01 16:55:00               475
1 2016-02-13 09:00:00 2016-02-13 17:00:00               480
2 2016-02-14 09:00:00 2016-02-14 09:30:00                30
3 2018-03-18 09:00:00 2018-03-18 17:00:00               480
4 2018-03-19 09:00:00 2018-03-19 17:00:00               480
5 2018-03-20 09:00:00 2018-03-20 17:00:00               480
6 2011-05-23 09:00:00 2011-05-23 17:00:00               480
7 2011-05-24 09:00:00 2011-05-24 17:00:00               480
8 2011-05-25 09:00:00 2011-05-25 17:00:00               480

UPDATED: Here is a way to do what you have asked in your question and clarified in your comments:

df['days'] = df.durationInMinutes // (8 * 60) + 
    (df.durationInMinutes % (8 * 60) > 0).astype(int)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1) + 
    [x.durationInMinutes % (8 * 60) + 
    (x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = (pd.to_datetime(df.startTime, dayfirst=True).astype('int64') + 
        df.daysToAdd * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df.endTime = (df.startTime.astype('int64') + 
        df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)

Explanation:

add column days with number of result rows corresponding to each input row.
update durationInMinutes column to contain a list for each row of the minutes in each result row originating from the input row.
add column daysToAdd with a list for each row of the number of days to add to the startTime for each result row originating from the input row.
use explode() to create result rows with one value each from the lists in durationInMinutes and daysToAdd.
add the nanosecond equivalent of daysToAdd to startTime.
update endTime to be startTime plus the nanosecond equivalent of durationInMinutes.
use drop() to eliminate unneeded columns, and use reset_index() to get an integer range index starting at 0 that advances by 1 for each row.

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00               1445
1  13-02-2016 09:00:00  14-02-2016 17:00:00               1435
2  18-03-2018 09:00:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

             startTime             endTime durationInMinutes
0  2010-06-01 09:00:00 2010-06-01 17:00:00               480
1  2010-06-02 09:00:00 2010-06-02 17:00:00               480
2  2010-06-03 09:00:00 2010-06-03 17:00:00               480
3  2010-06-04 09:00:00 2010-06-04 09:05:00                 5
4  2016-02-13 09:00:00 2016-02-13 17:00:00               480
5  2016-02-14 09:00:00 2016-02-14 17:00:00               480
6  2016-02-15 09:00:00 2016-02-15 16:55:00               475
7  2018-03-18 09:00:00 2018-03-18 17:00:00               480
8  2018-03-19 09:00:00 2018-03-19 17:00:00               480
9  2018-03-20 09:00:00 2018-03-20 17:00:00               480
10 2011-05-23 09:00:00 2011-05-23 17:00:00               480
11 2011-05-24 09:00:00 2011-05-24 17:00:00               480
12 2011-05-25 09:00:00 2011-05-25 17:00:00               480

ORIGINAL ANSWER:

Here's a way to do what your question asks:

df = pd.concat([df.assign(durationInMinutes=df.durationInMinutes/3, 
    orig_row=i).reset_index() for i in range(3)])
for col in ['startTime', 'endTime']:
    df[col] = (pd.to_datetime(df[col], dayfirst=True).astype('int64') + 
        df.orig_row * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df = df.sort_values('index').drop(columns=['index', 'orig_row'])

Explanation:

overwrite the durationInMinutes column by dividing its contents by 3.
create 3 copies of df, each with a new column orig_row containing an integer corresponding to which number the copy is (0, 1 or 2).
for each of startTime and endTime, convert the string value to a datetime in nanoseconds and add the nanosecond equivalent of the number of days in orig_row to it (24 hours * 60 minutes * 60 seconds * 1bn nanoseconds).
sort by the original index (so consecutive days are grouped together) and drop the intermediate columns.

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00               1440
1  13-02-2016 09:00:00  14-02-2016 17:00:00               1440
2  18-03-2018 09:00:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

            startTime             endTime  durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00              480.0
0 2010-06-02 09:00:00 2010-06-03 17:00:00              480.0
0 2010-06-03 09:00:00 2010-06-04 17:00:00              480.0
1 2016-02-13 09:00:00 2016-02-14 17:00:00              480.0
1 2016-02-14 09:00:00 2016-02-15 17:00:00              480.0
1 2016-02-15 09:00:00 2016-02-16 17:00:00              480.0
2 2018-03-18 09:00:00 2018-03-19 17:00:00              480.0
2 2018-03-19 09:00:00 2018-03-20 17:00:00              480.0
2 2018-03-20 09:00:00 2018-03-21 17:00:00              480.0
3 2011-05-23 09:00:00 2011-05-24 17:00:00              480.0
3 2011-05-24 09:00:00 2011-05-25 17:00:00              480.0
3 2011-05-25 09:00:00 2011-05-26 17:00:00              480.0

TROUBLESHOOTING:

IN a comment, OP mentions getting ValueError: columns must have matching element counts in df.explode(). A print(df) just before the explode() line in the UPDATED solution in my environment gives:

             startTime              endTime   durationInMinutes  days     daysToAdd
0  01-06-2010 09:00:00  02-06-2010 17:00:00  [480, 480, 480, 5]     4  (0, 1, 2, 3)
1  13-02-2016 09:00:00  14-02-2016 17:00:00     [480, 480, 475]     3     (0, 1, 2)
2  18-03-2018 09:00:00  19-03-2018 17:00:00     [480, 480, 480]     3     (0, 1, 2)
3  23-05-2011 09:00:00  24-05-2011 17:00:00     [480, 480, 480]     3     (0, 1, 2)

Pandas dataframe create new rows based on condition from another column

Answers (2)

Related Questions