elnino
elnino

Reputation: 173

Pandas dataframe create new rows based on condition from another column

I have dataframe:

data = {'startTime':['01-06-2010 09:00:00', '13-02-2016 09:00:00', '18-03-2018 09:00:00', '23-05-2011 09:00:00'], 'endTime':['02-06-2010 17:00:00', '14-02-2016 17:00:00', '19-03-2018 17:00:00', '24-05-2011 17:00:00'], 'durationInMinutes': [1440, 1440, 1440, 1440]}

df = pd.DataFrame(data)

I want to make additional rows by dividing 1440 minutes into equal intervals of 8 hours per day. So, 1440 minutes would be equal to 3 days (3 additional rows) (9am - 5pm). Minutes can be more than 1440. New rows in startTime and endTime would be:

startTime               endTime

01-06-2010 09:00:00     01-06-2010 17:00:00
02-06-2010 09:00:00     02-06-2010 17:00:00
03-06-2010 09:00:00     03-06-2010 17:00:00
13-02-2016 09:00:00     13-02-2016 17:00:00
14-02-2016 09:00:00     14-02-2016 17:00:00
15-02-2016 09:00:00     15-02-2016 17:00:00

Can anyone please help me with this? Thanks.

Upvotes: 0

Views: 147

Answers (2)

constantstranger
constantstranger

Reputation: 9379

UPDATE #2:

Here is a way to do what the question asks, as clarified by multiple OP comments, which is to:

  • for each existing row, create enough rows in the result to accommodate consecutive days with periods of 8 hours (or, for the final day, up to 8 hours) such that the aggregate duration of this group of result rows matches the durationInMinutes column value in the input.
  • set durationInMinutes for each result row to match the above logic.
  • ensure that the startTime of each result row is 09:00 on the appropriate date.
  • ensure that the endTime of each result row is a period of durationInMinutes later than startTime to match the above logic.
df['days'] = df.durationInMinutes // (8 * 60) + (df.durationInMinutes % (8 * 60) > 0)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1) + 
    [x.durationInMinutes % (8 * 60) + (x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = pd.to_datetime(df.startTime, dayfirst=True)
df.startTime = pd.to_datetime(pd.DataFrame({
    'year':df.startTime.dt.year, 'month':df.startTime.dt.month, 
    'day':df.startTime.dt.day + df.daysToAdd, 
    'hour':[9]*len(df.index)}))
df.endTime = (df.startTime.astype('int64') + 
        df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00                475
1  13-02-2016 08:30:00  14-02-2016 17:00:00                510
2  18-03-2018 09:30:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

            startTime             endTime durationInMinutes
0 2010-06-01 09:00:00 2010-06-01 16:55:00               475
1 2016-02-13 09:00:00 2016-02-13 17:00:00               480
2 2016-02-14 09:00:00 2016-02-14 09:30:00                30
3 2018-03-18 09:00:00 2018-03-18 17:00:00               480
4 2018-03-19 09:00:00 2018-03-19 17:00:00               480
5 2018-03-20 09:00:00 2018-03-20 17:00:00               480
6 2011-05-23 09:00:00 2011-05-23 17:00:00               480
7 2011-05-24 09:00:00 2011-05-24 17:00:00               480
8 2011-05-25 09:00:00 2011-05-25 17:00:00               480

UPDATED: Here is a way to do what you have asked in your question and clarified in your comments:

df['days'] = df.durationInMinutes // (8 * 60) + 
    (df.durationInMinutes % (8 * 60) > 0).astype(int)
df['durationInMinutes'] = df.apply(lambda x: [8 * 60] * (x.days - 1) + 
    [x.durationInMinutes % (8 * 60) + 
    (x.durationInMinutes % (8 * 60) == 0) * 8 * 60], axis=1)
df['daysToAdd'] = df.days.apply(lambda x: range(x))
df = df.explode(['durationInMinutes', 'daysToAdd'])
df.startTime = (pd.to_datetime(df.startTime, dayfirst=True).astype('int64') + 
        df.daysToAdd * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df.endTime = (df.startTime.astype('int64') + 
        df.durationInMinutes * 60*1_000_000_000).astype('datetime64[ns]')
df = df.drop(columns=['days', 'daysToAdd']).reset_index(drop=True)

Explanation:

  • add column days with number of result rows corresponding to each input row.
  • update durationInMinutes column to contain a list for each row of the minutes in each result row originating from the input row.
  • add column daysToAdd with a list for each row of the number of days to add to the startTime for each result row originating from the input row.
  • use explode() to create result rows with one value each from the lists in durationInMinutes and daysToAdd.
  • add the nanosecond equivalent of daysToAdd to startTime.
  • update endTime to be startTime plus the nanosecond equivalent of durationInMinutes.
  • use drop() to eliminate unneeded columns, and use reset_index() to get an integer range index starting at 0 that advances by 1 for each row.

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00               1445
1  13-02-2016 09:00:00  14-02-2016 17:00:00               1435
2  18-03-2018 09:00:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

             startTime             endTime durationInMinutes
0  2010-06-01 09:00:00 2010-06-01 17:00:00               480
1  2010-06-02 09:00:00 2010-06-02 17:00:00               480
2  2010-06-03 09:00:00 2010-06-03 17:00:00               480
3  2010-06-04 09:00:00 2010-06-04 09:05:00                 5
4  2016-02-13 09:00:00 2016-02-13 17:00:00               480
5  2016-02-14 09:00:00 2016-02-14 17:00:00               480
6  2016-02-15 09:00:00 2016-02-15 16:55:00               475
7  2018-03-18 09:00:00 2018-03-18 17:00:00               480
8  2018-03-19 09:00:00 2018-03-19 17:00:00               480
9  2018-03-20 09:00:00 2018-03-20 17:00:00               480
10 2011-05-23 09:00:00 2011-05-23 17:00:00               480
11 2011-05-24 09:00:00 2011-05-24 17:00:00               480
12 2011-05-25 09:00:00 2011-05-25 17:00:00               480

ORIGINAL ANSWER:

Here's a way to do what your question asks:

df = pd.concat([df.assign(durationInMinutes=df.durationInMinutes/3, 
    orig_row=i).reset_index() for i in range(3)])
for col in ['startTime', 'endTime']:
    df[col] = (pd.to_datetime(df[col], dayfirst=True).astype('int64') + 
        df.orig_row * 24*60*60*1_000_000_000).astype('datetime64[ns]')
df = df.sort_values('index').drop(columns=['index', 'orig_row'])

Explanation:

  • overwrite the durationInMinutes column by dividing its contents by 3.
  • create 3 copies of df, each with a new column orig_row containing an integer corresponding to which number the copy is (0, 1 or 2).
  • for each of startTime and endTime, convert the string value to a datetime in nanoseconds and add the nanosecond equivalent of the number of days in orig_row to it (24 hours * 60 minutes * 60 seconds * 1bn nanoseconds).
  • sort by the original index (so consecutive days are grouped together) and drop the intermediate columns.

Input:

             startTime              endTime  durationInMinutes
0  01-06-2010 09:00:00  02-06-2010 17:00:00               1440
1  13-02-2016 09:00:00  14-02-2016 17:00:00               1440
2  18-03-2018 09:00:00  19-03-2018 17:00:00               1440
3  23-05-2011 09:00:00  24-05-2011 17:00:00               1440

Output:

            startTime             endTime  durationInMinutes
0 2010-06-01 09:00:00 2010-06-02 17:00:00              480.0
0 2010-06-02 09:00:00 2010-06-03 17:00:00              480.0
0 2010-06-03 09:00:00 2010-06-04 17:00:00              480.0
1 2016-02-13 09:00:00 2016-02-14 17:00:00              480.0
1 2016-02-14 09:00:00 2016-02-15 17:00:00              480.0
1 2016-02-15 09:00:00 2016-02-16 17:00:00              480.0
2 2018-03-18 09:00:00 2018-03-19 17:00:00              480.0
2 2018-03-19 09:00:00 2018-03-20 17:00:00              480.0
2 2018-03-20 09:00:00 2018-03-21 17:00:00              480.0
3 2011-05-23 09:00:00 2011-05-24 17:00:00              480.0
3 2011-05-24 09:00:00 2011-05-25 17:00:00              480.0
3 2011-05-25 09:00:00 2011-05-26 17:00:00              480.0

TROUBLESHOOTING:

IN a comment, OP mentions getting ValueError: columns must have matching element counts in df.explode(). A print(df) just before the explode() line in the UPDATED solution in my environment gives:

             startTime              endTime   durationInMinutes  days     daysToAdd
0  01-06-2010 09:00:00  02-06-2010 17:00:00  [480, 480, 480, 5]     4  (0, 1, 2, 3)
1  13-02-2016 09:00:00  14-02-2016 17:00:00     [480, 480, 475]     3     (0, 1, 2)
2  18-03-2018 09:00:00  19-03-2018 17:00:00     [480, 480, 480]     3     (0, 1, 2)
3  23-05-2011 09:00:00  24-05-2011 17:00:00     [480, 480, 480]     3     (0, 1, 2)

Upvotes: 1

mozway
mozway

Reputation: 260640

IIUC, you can use:

# ensure datetime
df[['startTime', 'endTime']] = df[['startTime', 'endTime']].apply(pd.to_datetime, dayfirst=True)

# compute number of rows in days
extra = np.ceil(df['durationInMinutes'].div(60*8)).astype(int)
# compute a shift (+0, +1, +2days etc.)
shift = extra.repeat(extra).groupby(level=0).cumcount().mul(pd.Timedelta('1day'))

# duplicate the rows
df2 = df.loc[df.index.repeat(extra)].reset_index(drop=True)

# add the shift
df2[['startTime', 'endTime']] = df2[['startTime', 'endTime']].add(shift.values, axis=0)

print(df2)

output:

             startTime             endTime  durationInMinutes
0  2010-06-01 09:00:00 2010-06-02 17:00:00               1440
1  2010-06-02 09:00:00 2010-06-03 17:00:00               1440
2  2010-06-03 09:00:00 2010-06-04 17:00:00               1440
3  2016-02-13 09:00:00 2016-02-14 17:00:00               1440
4  2016-02-14 09:00:00 2016-02-15 17:00:00               1440
5  2016-02-15 09:00:00 2016-02-16 17:00:00               1440
6  2018-03-18 09:00:00 2018-03-19 17:00:00               1440
7  2018-03-19 09:00:00 2018-03-20 17:00:00               1440
8  2018-03-20 09:00:00 2018-03-21 17:00:00               1440
9  2011-05-23 09:00:00 2011-05-24 17:00:00               1440
10 2011-05-24 09:00:00 2011-05-25 17:00:00               1440
11 2011-05-25 09:00:00 2011-05-26 17:00:00               1440

Upvotes: 1

Related Questions