Mardoxx
Mardoxx

Reputation: 4482

Ungroup events described by start and end times

Given the following

df = pd.DataFrame([[pd.to_datetime("2019-03-27 11:22:33"), pd.to_datetime("2019-03-28 16:54:55"), "A"],
                  [pd.to_datetime("2019-03-28 16:54:55"), pd.to_datetime("2019-03-29 19:54:55"), "B"],
                  [pd.to_datetime("2019-03-29 19:54:55"), pd.to_datetime("2019-03-29 19:55:55"), "B"],
                  [pd.to_datetime("2019-03-29 19:55:55"), pd.to_datetime("2019-03-29 20:55:55"), "A"]], columns = ["start", "end", "type"])
start                 | end                   | type
0 2019-03-27 11:22:33 | 2019-03-28 16:54:55   | A    
1 2019-03-28 16:54:55 | 2019-03-29 19:54:55   | B    
2 2019-03-29 19:54:55 | 2019-03-29 19:55:55   | B    
3 2019-03-29 19:55:55 | 2019-03-29 20:55:55   | A    

How can I un-group them per calendar day so that events which span multiple days get separate rows, e.g. so the above looks like this?

start                 | end                   | type          
0 2019-03-27 11:22:33 | 2019-03-28 00:00:00   | A    
1 2019-03-28 00:00:00 | 2019-03-28 16:54:55   | A    
2 2019-03-28 16:54:55 | 2019-03-29 00:00:00   | B    
3 2019-03-29 00:00:00 | 2019-03-29 19:54:55   | B    
4 2019-03-29 19:54:55 | 2019-03-29 19:55:55   | B    
5 2019-03-29 19:55:55 | 2019-03-29 20:55:55   | A    

I can't think of a neat way to do it.

Upvotes: 1

Views: 63

Answers (2)

Mardoxx
Mardoxx

Reputation: 4482

df = pd.DataFrame([[pd.to_datetime("2019-02-25 00:00:00"), pd.to_datetime("2019-02-26 00:00:00"), "C"],
                  [pd.to_datetime("2019-02-26 00:00:00"), pd.to_datetime("2019-02-27 00:00:00"), "B"],
                  [pd.to_datetime("2019-02-27 00:00:00"), pd.to_datetime("2019-02-27 11:15:33"), "B"],
                  [pd.to_datetime("2019-02-27 11:15:33"), pd.to_datetime("2019-02-27 11:22:33"), "B"],
                  [pd.to_datetime("2019-03-27 11:22:33"), pd.to_datetime("2019-03-28 16:54:55"), "A"],
                  [pd.to_datetime("2019-03-28 16:54:55"), pd.to_datetime("2019-03-29 19:54:55"), "B"],
                  [pd.to_datetime("2019-03-29 19:54:55"), pd.to_datetime("2019-03-29 19:55:55"), "B"],
                  [pd.to_datetime("2019-03-29 19:55:55"), pd.to_datetime("2019-03-29 20:55:55"), "A"],
                  [pd.to_datetime("2019-03-29 20:55:55"), pd.to_datetime("2019-04-01 04:23:01"), "A"]], columns = ["start", "end", "type"])
new_df = pd.DataFrame()

for index, row in df.iterrows():
    start, end = row[["start", "end"]]
    start_date, end_date = [pd.to_datetime(x.date()) for x in [start, end]]

    # whole event encapsulated in one calendar day
    #  or event duration 24h and starts is midnight
    # append row as is
    if (start_date == end_date) or ((end_date - start_date) == pd.Timedelta(days = 1)):
        new_df = new_df.append([row])
        continue

    # start -> subsequent day at midnight
    row["start"] = start
    row["end"] = start_date + pd.Timedelta(days = 1)
    new_df = new_df.append(row)

    # midnight -> mignight subsequent day iff spans > 2 days
    for i in range((end_date - start_date).days - 1):
        row["start"] = start_date + pd.Timedelta(days = i + 1)
        row["end"] = start_date + pd.Timedelta(days = i + 2)
        new_df = new_df.append(row)

    #end_date at midnight -> end
    row["start"] = end_date
    row["end"] = end
    new_df = new_df.append(row)

new_df = new_df.reset_index(drop = True)
new_df = new_df[df.columns]
                 start                 end type
0  2019-02-25 00:00:00 2019-02-26 00:00:00    C
1  2019-02-26 00:00:00 2019-02-27 00:00:00    B
2  2019-02-27 00:00:00 2019-02-27 11:15:33    B
3  2019-02-27 11:15:33 2019-02-27 11:22:33    B
4  2019-03-27 11:22:33 2019-03-28 16:54:55    A
5  2019-03-28 16:54:55 2019-03-29 19:54:55    B
6  2019-03-29 19:54:55 2019-03-29 19:55:55    B
7  2019-03-29 19:55:55 2019-03-29 20:55:55    A
8  2019-03-29 20:55:55 2019-03-30 00:00:00    A
9  2019-03-30 00:00:00 2019-03-31 00:00:00    A
10 2019-03-31 00:00:00 2019-04-01 00:00:00    A
11 2019-04-01 00:00:00 2019-04-01 04:23:01    A

Upvotes: 1

riffle_shuffle
riffle_shuffle

Reputation: 416

Or you can use the following solution:

Make a Series of normalized dates

ds = pd.Series(pd.date_range(df.start.iloc[0], df.end.iloc[-1],
                             freq='D', normalize=True,
                             closed='right'))

Create a new DataFrame from Series and set end column as index, since it will determine the 'type' column in the next step

foo_df = pd.DataFrame([sorted(df.start.append(ds)), sorted(df.end.append(ds))],
                      index=['start', 'end']).T.set_index(['end'])

Join on index type column from the initial DataFrame, re-index and reorder columns

foo_df = (foo_df.join(df.set_index('end')['type']).fillna(method='bfill')
                                                  .reset_index()[['start', 'end', 'type']])

yields

                 start                 end type
0  2019-03-27 11:15:33 2019-03-27 11:22:33    B
1  2019-03-27 11:22:33 2019-03-28 00:00:00    A
2  2019-03-28 00:00:00 2019-03-28 16:54:55    A
3  2019-03-28 16:54:55 2019-03-29 00:00:00    B
4  2019-03-29 00:00:00 2019-03-29 19:54:55    B
5  2019-03-29 19:54:55 2019-03-29 19:55:55    B
6  2019-03-29 19:55:55 2019-03-29 20:55:55    A
7  2019-03-29 20:55:55 2019-03-30 00:00:00    A
8  2019-03-30 00:00:00 2019-03-31 00:00:00    A
9  2019-03-31 00:00:00 2019-04-01 00:00:00    A
10 2019-04-01 00:00:00 2019-04-01 04:23:01    A

Upvotes: 3

Related Questions