Reputation: 4482
Given the following
df = pd.DataFrame([[pd.to_datetime("2019-03-27 11:22:33"), pd.to_datetime("2019-03-28 16:54:55"), "A"],
[pd.to_datetime("2019-03-28 16:54:55"), pd.to_datetime("2019-03-29 19:54:55"), "B"],
[pd.to_datetime("2019-03-29 19:54:55"), pd.to_datetime("2019-03-29 19:55:55"), "B"],
[pd.to_datetime("2019-03-29 19:55:55"), pd.to_datetime("2019-03-29 20:55:55"), "A"]], columns = ["start", "end", "type"])
start | end | type
0 2019-03-27 11:22:33 | 2019-03-28 16:54:55 | A
1 2019-03-28 16:54:55 | 2019-03-29 19:54:55 | B
2 2019-03-29 19:54:55 | 2019-03-29 19:55:55 | B
3 2019-03-29 19:55:55 | 2019-03-29 20:55:55 | A
How can I un-group them per calendar day so that events which span multiple days get separate rows, e.g. so the above looks like this?
start | end | type
0 2019-03-27 11:22:33 | 2019-03-28 00:00:00 | A
1 2019-03-28 00:00:00 | 2019-03-28 16:54:55 | A
2 2019-03-28 16:54:55 | 2019-03-29 00:00:00 | B
3 2019-03-29 00:00:00 | 2019-03-29 19:54:55 | B
4 2019-03-29 19:54:55 | 2019-03-29 19:55:55 | B
5 2019-03-29 19:55:55 | 2019-03-29 20:55:55 | A
I can't think of a neat way to do it.
Upvotes: 1
Views: 63
Reputation: 4482
df = pd.DataFrame([[pd.to_datetime("2019-02-25 00:00:00"), pd.to_datetime("2019-02-26 00:00:00"), "C"],
[pd.to_datetime("2019-02-26 00:00:00"), pd.to_datetime("2019-02-27 00:00:00"), "B"],
[pd.to_datetime("2019-02-27 00:00:00"), pd.to_datetime("2019-02-27 11:15:33"), "B"],
[pd.to_datetime("2019-02-27 11:15:33"), pd.to_datetime("2019-02-27 11:22:33"), "B"],
[pd.to_datetime("2019-03-27 11:22:33"), pd.to_datetime("2019-03-28 16:54:55"), "A"],
[pd.to_datetime("2019-03-28 16:54:55"), pd.to_datetime("2019-03-29 19:54:55"), "B"],
[pd.to_datetime("2019-03-29 19:54:55"), pd.to_datetime("2019-03-29 19:55:55"), "B"],
[pd.to_datetime("2019-03-29 19:55:55"), pd.to_datetime("2019-03-29 20:55:55"), "A"],
[pd.to_datetime("2019-03-29 20:55:55"), pd.to_datetime("2019-04-01 04:23:01"), "A"]], columns = ["start", "end", "type"])
new_df = pd.DataFrame()
for index, row in df.iterrows():
start, end = row[["start", "end"]]
start_date, end_date = [pd.to_datetime(x.date()) for x in [start, end]]
# whole event encapsulated in one calendar day
# or event duration 24h and starts is midnight
# append row as is
if (start_date == end_date) or ((end_date - start_date) == pd.Timedelta(days = 1)):
new_df = new_df.append([row])
continue
# start -> subsequent day at midnight
row["start"] = start
row["end"] = start_date + pd.Timedelta(days = 1)
new_df = new_df.append(row)
# midnight -> mignight subsequent day iff spans > 2 days
for i in range((end_date - start_date).days - 1):
row["start"] = start_date + pd.Timedelta(days = i + 1)
row["end"] = start_date + pd.Timedelta(days = i + 2)
new_df = new_df.append(row)
#end_date at midnight -> end
row["start"] = end_date
row["end"] = end
new_df = new_df.append(row)
new_df = new_df.reset_index(drop = True)
new_df = new_df[df.columns]
start end type
0 2019-02-25 00:00:00 2019-02-26 00:00:00 C
1 2019-02-26 00:00:00 2019-02-27 00:00:00 B
2 2019-02-27 00:00:00 2019-02-27 11:15:33 B
3 2019-02-27 11:15:33 2019-02-27 11:22:33 B
4 2019-03-27 11:22:33 2019-03-28 16:54:55 A
5 2019-03-28 16:54:55 2019-03-29 19:54:55 B
6 2019-03-29 19:54:55 2019-03-29 19:55:55 B
7 2019-03-29 19:55:55 2019-03-29 20:55:55 A
8 2019-03-29 20:55:55 2019-03-30 00:00:00 A
9 2019-03-30 00:00:00 2019-03-31 00:00:00 A
10 2019-03-31 00:00:00 2019-04-01 00:00:00 A
11 2019-04-01 00:00:00 2019-04-01 04:23:01 A
Upvotes: 1
Reputation: 416
Or you can use the following solution:
Make a Series
of normalized dates
ds = pd.Series(pd.date_range(df.start.iloc[0], df.end.iloc[-1],
freq='D', normalize=True,
closed='right'))
Create a new DataFrame
from Series
and set end
column as index, since it will determine the 'type' column in the next step
foo_df = pd.DataFrame([sorted(df.start.append(ds)), sorted(df.end.append(ds))],
index=['start', 'end']).T.set_index(['end'])
Join on index type
column from the initial DataFrame
, re-index and reorder columns
foo_df = (foo_df.join(df.set_index('end')['type']).fillna(method='bfill')
.reset_index()[['start', 'end', 'type']])
yields
start end type
0 2019-03-27 11:15:33 2019-03-27 11:22:33 B
1 2019-03-27 11:22:33 2019-03-28 00:00:00 A
2 2019-03-28 00:00:00 2019-03-28 16:54:55 A
3 2019-03-28 16:54:55 2019-03-29 00:00:00 B
4 2019-03-29 00:00:00 2019-03-29 19:54:55 B
5 2019-03-29 19:54:55 2019-03-29 19:55:55 B
6 2019-03-29 19:55:55 2019-03-29 20:55:55 A
7 2019-03-29 20:55:55 2019-03-30 00:00:00 A
8 2019-03-30 00:00:00 2019-03-31 00:00:00 A
9 2019-03-31 00:00:00 2019-04-01 00:00:00 A
10 2019-04-01 00:00:00 2019-04-01 04:23:01 A
Upvotes: 3