Reputation: 148
Good afternoon,
I'm working on pre-processing data that's streaming from sensors and which normally comes in every second (1hz). However this is not always the case, there are instances when there is gaps of data of 2s, 3s and even more.
I'm trying to set up some code that fills these gaps but only when they're smaller than some number, say 10 seconds.
The data comes in as follows:
Timestamp Sensor1 Sensor2 Sensor3
7/1/2020 00:00:00 5 135 77
7/1/2020 00:00:01 6 118 79
7/1/2020 00:00:02 4 131 75
7/1/2020 00:00:04 3 125 78
7/1/2020 00:00:05 9 145 67
7/1/2020 00:00:06 6 136 71
7/1/2020 00:00:10 7 141 77
7/1/2020 00:00:11 4 145 72
What I'd like to do is fill the dataframe whenever the missed window is smaller than 10 seconds, and fill it with the average of the two neighboring values.
Timestamp Sensor1 Sensor2 Sensor3
7/1/2020 00:00:00 5 135 77
7/1/2020 00:00:01 6 118 79
7/1/2020 00:00:02 4 131 75
7/1/2020 00:00:03 3.5 128 76.5
7/1/2020 00:00:04 3 125 78
7/1/2020 00:00:05 9 145 67
7/1/2020 00:00:06 6 136 71
7/1/2020 00:00:07 6.5 138.5 74
7/1/2020 00:00:08 6.5 138.5 74
7/1/2020 00:00:09 6.5 138.5 74
7/1/2020 00:00:10 7 141 77
7/1/2020 00:00:11 4 145 72
I think once I can set up the proper time "grid" with no missing seconds filling it should be relatively simple using the fill method. But how do I tell it to only fill windows smaller than 10 seconds?
Thanks in advance
Upvotes: 1
Views: 343
Reputation: 153460
Try:
#Make sure Timestamp is datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
#Create DatetimeIndex by moving Timestamp into index
df = df.set_index('Timestamp')
#Resample dataframe with freq seconds
dfs = df.resample('S')
#Calculate average using forward fill and bfill get last and next non-NaN values
df_out = (dfs.ffill()+dfs.bfill()) / 2
df_out
Output:
Sensor1 Sensor2 Sensor3
Timestamp
2020-07-01 00:00:00 5.0 135.0 77.0
2020-07-01 00:00:01 6.0 118.0 79.0
2020-07-01 00:00:02 4.0 131.0 75.0
2020-07-01 00:00:03 3.5 128.0 76.5
2020-07-01 00:00:04 3.0 125.0 78.0
2020-07-01 00:00:05 9.0 145.0 67.0
2020-07-01 00:00:06 6.0 136.0 71.0
2020-07-01 00:00:07 6.5 138.5 74.0
2020-07-01 00:00:08 6.5 138.5 74.0
2020-07-01 00:00:09 6.5 138.5 74.0
2020-07-01 00:00:10 7.0 141.0 77.0
2020-07-01 00:00:11 4.0 145.0 72.0
Upvotes: 0
Reputation: 29635
without the 10s missing window, it is something with resample
and interpolate
.
df.set_index('Timestamp').resample('s').interpolate().reset_index()
To add the filling only when less than 10s missing, then you can use groupby
and get a new group where the diff
between 2 rows are less than 10 in seconds. Note: to see it, I changed in your data 10 by 20 and 11 by 22 in the last two timestamps.
print (df.set_index('Timestamp')
.groupby(df['Timestamp'].diff().dt.total_seconds()
.gt(10).cumsum()
.to_numpy())
.apply(lambda x: x.resample('s').interpolate())
.reset_index()
.drop('level_0', axis=1)
)
Timestamp Sensor1 Sensor2 Sensor3
0 2020-07-01 00:00:00 5.0 135.0 77.0
1 2020-07-01 00:00:01 6.0 118.0 79.0
2 2020-07-01 00:00:02 4.0 131.0 75.0
3 2020-07-01 00:00:03 3.5 128.0 76.5
4 2020-07-01 00:00:04 3.0 125.0 78.0
5 2020-07-01 00:00:05 9.0 145.0 67.0
6 2020-07-01 00:00:06 6.0 136.0 71.0
7 2020-07-01 00:00:20 7.0 141.0 77.0
8 2020-07-01 00:00:21 5.5 143.0 74.5
9 2020-07-01 00:00:22 4.0 145.0 72.0
Upvotes: 1