Pandas resample only when makes sense

Question

I have a time series that is very irregular. The difference in time, between two records can be 1s or 10 days.

I want to resample the data every 1h, but only when the sequential records are less than 1h.

How to approach this, without making too many loops?

In the example above, I would like to resample only rows 5-6 (delta difference is 10s) and rows 6-7 (delta difference is 50min). The others should remain as they are.

tmp=vals[['datumtijd','filter data']]

     datumtijd           filter data
0   1970-11-01 00:00:00        129.0
1   1970-12-01 00:00:00        143.0
2   1971-01-05 00:00:00        151.0
3   1971-02-01 00:00:00        151.0
4   1971-03-01 00:00:00        163.0
5   1971-03-01 00:00:10        163.0
6   1971-03-01 00:00:20        163.0
7   1971-03-01 00:01:10        163.0
8   1971-03-01 00:04:10        163.0
..         ...          ...
244 1981-08-19 00:00:00        102.0
245 1981-09-02 00:00:00         98.0
246 1981-09-17 00:00:00         92.0
247 1981-10-01 00:00:00         89.0
248 1981-10-19 00:00:00         92.0

Tom · Accepted Answer

You can be a little explicit about this by using groupby on the hour-floor of the time stamps:

grouped = df.groupby(df['datumtijd'].dt.floor('1H')).mean()

This is explicitly looking for the hour of each existing data point and grouping the matching ones.

But you can also just do the resample and then filter out the empty data, as pandas can still do this pretty quickly:

resampled = df.resample('1H', on='datumtijd').mean().dropna()

In either case, you get the following (note that I changed the last time stamp just so that the console would show the hours):

                     filter data
datumtijd                       
1970-11-01 00:00:00        129.0
1970-12-01 00:00:00        143.0
1971-01-05 00:00:00        151.0
1971-02-01 00:00:00        151.0
1971-03-01 00:00:00        163.0
1981-08-19 00:00:00        102.0
1981-09-02 00:00:00         98.0
1981-09-17 00:00:00         92.0
1981-10-01 00:00:00         89.0
1981-10-19 03:00:00         92.0

One quick clarification also. In your example, rows 5-8 all occur within the same hour, so they all get grouped together (hour:minute:second)!.

Also, see this related post.

Pandas resample only when makes sense

Answers (1)

Related Questions