Reputation: 534
I have a huge datetime index
which is supposed to have 1 minute frequency. I know that there are periods of missing data. I would like to detect all missing data periods and find start and end dates for each of them. So far I figured out how to find missing timestamps:
fullrange = pd.date_range(start = obs.index.min(), end = obs.index.max(), freq = "1T")
missing_dates = obs.index.difference(fullrange)
Now I don't know how to separate missing_dates
into periods and find the start and end dates for them.
The obs.index
looks like this:
DatetimeIndex(['2020-05-10 09:08:00', '2020-05-10 09:09:00',
'2020-05-10 09:10:00', '2020-05-10 09:11:00',
'2020-05-10 09:12:00', '2020-05-10 09:13:00',
'2020-05-10 09:14:00', '2020-05-10 09:15:00',
'2020-05-10 09:16:00', '2020-05-10 12:24:00', # missing data
...
'2020-07-09 12:35:00', '2020-07-09 12:36:00',
'2020-07-09 12:37:00', '2020-07-09 12:38:00',
'2020-07-09 12:39:00', '2020-07-09 12:40:00',
'2020-07-09 12:41:00', '2020-07-09 12:42:00',
'2020-07-09 12:43:00', '2020-08-09 13:14:00'], # missing data
dtype='datetime64[ns]', name='timestamp', length=86617)
The expected result is a list of missing data periods, each period is a list with [start, end]:
[['2020-05-10 09:16:00', '2020-05-10 12:24:00'], ['2020-07-09 12:43:00', '2020-08-09 13:14:00']]
Upvotes: 2
Views: 633
Reputation: 862591
Use:
rng = pd.date_range('2017-04-03', periods=10, freq='T')
df = pd.DataFrame({'a': range(10)}, index=rng)
obs = df.iloc[[0,1,2,4,5,7,8,9]]
print (obs)
a
2017-04-03 00:00:00 0
2017-04-03 00:01:00 1
2017-04-03 00:02:00 2
2017-04-03 00:04:00 4
2017-04-03 00:05:00 5
2017-04-03 00:07:00 7
2017-04-03 00:08:00 8
2017-04-03 00:09:00 9
First are compared difference of all index values with omitted first value:
a = obs.index[obs.index.to_series().diff().ne(pd.Timedelta(1, 'T'))][1:]
Then get positions of this values by Index.get_indexer
:
pos = obs.index.get_indexer(a)
Convert output to strings:
idx = obs.index.astype(str)
And last use zip
with indexing previous values by subtract 1
with actual values in list comprehension
:
out = [list(x) for x in zip(idx[pos-1], idx[pos])]
print (out)
[['2017-04-03 00:02:00', '2017-04-03 00:04:00'],
['2017-04-03 00:05:00', '2017-04-03 00:07:00']]
Upvotes: 1