nkstack
nkstack

Reputation: 11

Need to have non NaN value within same sampling window in a DataFrame

I am trying to write an implementation of the GAPS operator in Kairos in Python.

What the GAPS operator does is introduces NaN values where there is no data point based on the sampling frequency. ` This an example in a sample dataset:

import pandas as pd

# Sample DataFrame
data = {'timestamp': ['2023-07-20 00:01:30', '2023-07-20 01:50:10', '2023-07-20 01:40:00', '2023-07-20 03:00:00'],
        'value': [10, 20, 30, 15]}
df = pd.DataFrame(data)

# Convert the 'timestamp' column to a pandas DateTimeIndex
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.sort_values('timestamp', inplace=True)
df.set_index('timestamp', inplace=True)


# # Create a new DatetimeIndex with 1-hour frequency
# start_time = df.index.min().floor('H')  # Round down to the nearest hour
# end_time = df.index.max().ceil('H')     # Round up to the nearest hour

start_time = '2023-07-20 00:00:00'
end_time = '2023-07-20 05:00:00'

new_index = pd.date_range(start=start_time, end=end_time, freq='1H')


# # Reindex the DataFrame with the new DatetimeIndex and fill missing values with NaN
df_reindexed = df.reindex(df.index.union(new_index))

print(df_reindexed)

What i got is:
                     value
2023-07-20 00:00:00    NaN
2023-07-20 00:01:30   10.0
2023-07-20 01:00:00    NaN
2023-07-20 01:40:00   30.0
2023-07-20 01:50:10   20.0
2023-07-20 02:00:00    NaN
2023-07-20 03:00:00   15.0
2023-07-20 04:00:00    NaN
2023-07-20 05:00:00    NaN

What i should get is :

                     value
2023-07-20 00:00:00    NaN
2023-07-20 00:01:30   10.0
2023-07-20 01:40:00   30.0
2023-07-20 01:50:10   20.0
2023-07-20 02:00:00    NaN
2023-07-20 03:00:00   15.0
2023-07-20 04:00:00    NaN
2023-07-20 05:00:00    NaN

The datapoint '2023-07-20 01:00:00 NaN', should not be present in the result as the sampling frequency is 1 hour and i already have 2 datapoints in the hour already present (2023-07-20 01:40:00, 2023-07-20 01:50:10)

Any help or reference would be appreciated.Cheers.

Upvotes: 1

Views: 28

Answers (1)

mozway
mozway

Reputation: 262484

I would post-process your output with boolean indexing to remove the unwanted rows:

# was the index existing?
m1 = ~df_reindexed.index.isin(new_index)
# is the value not a NA?
m2 = df_reindexed['value'].notna()
# is there less than 2 datapoints in the range?
m3 = (df_reindexed.groupby(pd.Grouper(level=0, freq='1H'))
      ['value'].transform('count').lt(2)
     )

# is any condition is True, keep the row
out = df_reindexed[m1|m2|m3]

NB. I'm not sure if m1 is really needed, you can remove it if not.

Output:

                     value
2023-07-20 00:00:00    NaN
2023-07-20 00:01:30   10.0
2023-07-20 01:40:00   30.0
2023-07-20 01:50:10   20.0
2023-07-20 02:00:00    NaN
2023-07-20 03:00:00   15.0
2023-07-20 04:00:00    NaN
2023-07-20 05:00:00    NaN

Intermediates:

                     value  ngroup  count     m1     m2     m3  m1|m2|m3
2023-07-20 00:00:00    NaN       0      1  False  False   True      True
2023-07-20 00:01:30   10.0       0      1   True   True   True      True
2023-07-20 01:00:00    NaN       1      2  False  False  False     False
2023-07-20 01:40:00   30.0       1      2   True   True  False      True
2023-07-20 01:50:10   20.0       1      2   True   True  False      True
2023-07-20 02:00:00    NaN       2      0  False  False   True      True
2023-07-20 03:00:00   15.0       3      1  False   True   True      True
2023-07-20 04:00:00    NaN       4      0  False  False   True      True
2023-07-20 05:00:00    NaN       5      0  False  False   True      True

Upvotes: 0

Related Questions