Reputation: 11
I am trying to write an implementation of the GAPS operator in Kairos in Python.
What the GAPS operator does is introduces NaN values where there is no data point based on the sampling frequency. ` This an example in a sample dataset:
import pandas as pd
# Sample DataFrame
data = {'timestamp': ['2023-07-20 00:01:30', '2023-07-20 01:50:10', '2023-07-20 01:40:00', '2023-07-20 03:00:00'],
'value': [10, 20, 30, 15]}
df = pd.DataFrame(data)
# Convert the 'timestamp' column to a pandas DateTimeIndex
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.sort_values('timestamp', inplace=True)
df.set_index('timestamp', inplace=True)
# # Create a new DatetimeIndex with 1-hour frequency
# start_time = df.index.min().floor('H') # Round down to the nearest hour
# end_time = df.index.max().ceil('H') # Round up to the nearest hour
start_time = '2023-07-20 00:00:00'
end_time = '2023-07-20 05:00:00'
new_index = pd.date_range(start=start_time, end=end_time, freq='1H')
# # Reindex the DataFrame with the new DatetimeIndex and fill missing values with NaN
df_reindexed = df.reindex(df.index.union(new_index))
print(df_reindexed)
What i got is:
value
2023-07-20 00:00:00 NaN
2023-07-20 00:01:30 10.0
2023-07-20 01:00:00 NaN
2023-07-20 01:40:00 30.0
2023-07-20 01:50:10 20.0
2023-07-20 02:00:00 NaN
2023-07-20 03:00:00 15.0
2023-07-20 04:00:00 NaN
2023-07-20 05:00:00 NaN
What i should get is :
value
2023-07-20 00:00:00 NaN
2023-07-20 00:01:30 10.0
2023-07-20 01:40:00 30.0
2023-07-20 01:50:10 20.0
2023-07-20 02:00:00 NaN
2023-07-20 03:00:00 15.0
2023-07-20 04:00:00 NaN
2023-07-20 05:00:00 NaN
The datapoint '2023-07-20 01:00:00 NaN', should not be present in the result as the sampling frequency is 1 hour and i already have 2 datapoints in the hour already present (2023-07-20 01:40:00, 2023-07-20 01:50:10)
Any help or reference would be appreciated.Cheers.
Upvotes: 1
Views: 28
Reputation: 262484
I would post-process your output with boolean indexing to remove the unwanted rows:
# was the index existing?
m1 = ~df_reindexed.index.isin(new_index)
# is the value not a NA?
m2 = df_reindexed['value'].notna()
# is there less than 2 datapoints in the range?
m3 = (df_reindexed.groupby(pd.Grouper(level=0, freq='1H'))
['value'].transform('count').lt(2)
)
# is any condition is True, keep the row
out = df_reindexed[m1|m2|m3]
NB. I'm not sure if m1
is really needed, you can remove it if not.
Output:
value
2023-07-20 00:00:00 NaN
2023-07-20 00:01:30 10.0
2023-07-20 01:40:00 30.0
2023-07-20 01:50:10 20.0
2023-07-20 02:00:00 NaN
2023-07-20 03:00:00 15.0
2023-07-20 04:00:00 NaN
2023-07-20 05:00:00 NaN
Intermediates:
value ngroup count m1 m2 m3 m1|m2|m3
2023-07-20 00:00:00 NaN 0 1 False False True True
2023-07-20 00:01:30 10.0 0 1 True True True True
2023-07-20 01:00:00 NaN 1 2 False False False False
2023-07-20 01:40:00 30.0 1 2 True True False True
2023-07-20 01:50:10 20.0 1 2 True True False True
2023-07-20 02:00:00 NaN 2 0 False False True True
2023-07-20 03:00:00 15.0 3 1 False True True True
2023-07-20 04:00:00 NaN 4 0 False False True True
2023-07-20 05:00:00 NaN 5 0 False False True True
Upvotes: 0