CaramelFix
CaramelFix

Reputation: 83

Resampling and filling missing data in pandas

I have a raw dataset that looks like this:

df = pd.DataFrame({'speed': [66.8,67,67.1,70,69],
                   'time': ['2017-08-09T05:41:30.168Z', '2017-08-09T05:41:31.136Z', '2017-08-09T05:41:31.386Z', '2017-08-09T05:41:31.103Z','2017-08-09T05:41:35.563Z' ]})

I could do some processing on it to make it look like (removed microseconds):

df['time']= pd.to_datetime(df.time)
df['time'] = df['time'].apply(lambda x: x.replace(microsecond=0))

>>> df
   speed                time
0   66.8 2017-08-09 05:41:30
1   67.0 2017-08-09 05:41:31
2   67.1 2017-08-09 05:41:31
3   70.0 2017-08-09 05:41:31
4   69.0 2017-08-09 05:41:35

I need to now resample the data so that any entries that arrived at the same timestamp are averaged together, and for the timestamps that did not receive any data, use the last available value. Like:

   speed                time
0   66.80 2017-08-09 05:41:30
1   68.03 2017-08-09 05:41:31
2   70.00 2017-08-09 05:41:32
3   70.00 2017-08-09 05:41:33
4   70.00 2017-08-09 05:41:34
5   69.00 2017-08-09 05:41:35

I understand this might involve the use of groupby and resample, but being a beginner I find myself struggling with these. Any ideas on how to proceed?

I have tried this but I am getting wrong results:

df.groupby( [df["time"].dt.second]).mean()
          speed
time           
30    66.800000
31    68.033333
35    69.000000

Upvotes: 8

Views: 6124

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210952

In [279]: df.resample('1S', on='time').mean().ffill()
Out[279]:
                         speed
time
2017-08-09 05:41:30  66.800000
2017-08-09 05:41:31  68.033333
2017-08-09 05:41:32  68.033333
2017-08-09 05:41:33  68.033333
2017-08-09 05:41:34  68.033333
2017-08-09 05:41:35  69.000000

Upvotes: 11

Related Questions