Reputation:
I would like to analyse time series data, where I have some millions of entries. The data has a granularity of one data entry per minute. During the weekend, per definition no data exists. As well as for one hour during a weekday.
I want to check for missing data during the week (so: if one or more minutes are missing).
How would I do this with high performance in Python (e.g. with a Pandas DataFrame
)
Upvotes: 5
Views: 3801
Reputation: 1357
Probably the easiest would be to compare your DatetimeIndex
with missing values to a reference DatetimeIndex
covering the same range with all values.
Here's an example where I create an arbitrary DatetimeIndex
and include some dummy values in a DataFrame
.
import pandas as pd
import numpy as np
#dummy data
date_range = pd.date_range('2017-01-01 00:00', '2017-01-01 00:59', freq='1Min')
df = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 1)))
df.index = date_range # set index
df_missing = df.drop(df.between_time('00:12', '00:14').index)
#check for missing datetimeindex values based on reference index (with all values)
missing_dates = df.index[~df.index.isin(df_missing.index)]
print(missing_dates)
Which will return:
DatetimeIndex(['2017-01-01 00:12:00', '2017-01-01 00:13:00',
'2017-01-01 00:14:00'],
dtype='datetime64[ns]', freq='T')
Upvotes: 7