Reputation: 5347
I have a pandas
dataframe
, with Start
and End
datatime.
df=pd.DataFrame(data=pd.date_range('20100201', periods=10, freq='5h3min'),columns=['Start'])
df.loc[:,'End']=df.loc[:,'Start']+pd.Timedelta(4,'h')
Start
and End
can be expected to be sorted interally, but gaps/overlaps may occur between consecutive rows.
I would like to create a new dataframe with the difference that if row contains midnight (e.g. midnight is contained in [Start
,End
]), the row is then split in two parts before and after midnight
ex:
Start End
0 2010-02-01 00:00:00 2010-02-01 04:00:00
1 2010-02-01 05:03:00 2010-02-01 09:03:00
2 2010-02-01 10:06:00 2010-02-01 14:06:00
3 2010-02-01 15:09:00 2010-02-01 19:09:00
4 2010-02-01 20:12:00 2010-02-02 00:12:00
5 2010-02-02 01:15:00 2010-02-02 05:15:00
should be
Start End
0 2010-02-01 00:00:00 2010-02-01 04:00:00
1 2010-02-01 05:03:00 2010-02-01 09:03:00
2 2010-02-01 10:06:00 2010-02-01 14:06:00
3 2010-02-01 15:09:00 2010-02-01 19:09:00
-----------------------------------------
4 2010-02-01 20:12:00 2010-02-01 23:59:00
5 2010-02-02 00:00:00 2010-02-02 00:12:00
-----------------------------------------
6 2010-02-02 01:15:00 2010-02-02 05:15:00
Upvotes: 4
Views: 480
Reputation: 11
I don't believe the above answer works when a midnight time occurs early in the list. Someone correct me if I'm wrong, but I believe anytime you drop the indexes of the "splits" it is then dropping too much from the original list then as well.
I recognize this isn't answering the above question, but I don't have the reputation to comment above. In my case, I believe I will likely just convert to a numpy array, insert rows where the midnight points are, and then copy data accordingly. Ugly, but should work.
Upvotes: 1
Reputation: 76346
You can concat the DataFrame of new pairs, then erase the old ones.
First find the splits:
splits = df[df.End.dt.date > df.Start.dt.date].copy()
Now concatenate and drop:
>>> pd.concat([
df,
pd.DataFrame({
'Start': list(splits.Start) + list(splits.End.dt.floor(freq='1D')),
'End': list(splits.Start.dt.ceil(freq='1D')) + list(splits.End)})
]).drop(splits.index).sort_values(by='Start')
End Start
0 2010-02-01 04:00:00 2010-02-01 00:00:00
1 2010-02-01 09:03:00 2010-02-01 05:03:00
2 2010-02-01 14:06:00 2010-02-01 10:06:00
3 2010-02-01 19:09:00 2010-02-01 15:09:00
0 2010-02-02 00:00:00 2010-02-01 20:12:00
2 2010-02-02 00:12:00 2010-02-02 00:00:00
5 2010-02-02 05:15:00 2010-02-02 01:15:00
6 2010-02-02 10:18:00 2010-02-02 06:18:00
7 2010-02-02 15:21:00 2010-02-02 11:21:00
8 2010-02-02 20:24:00 2010-02-02 16:24:00
1 2010-02-03 00:00:00 2010-02-02 21:27:00
3 2010-02-03 01:27:00 2010-02-03 00:00:00
Upvotes: 4