Reputation: 363
I am looking fopr a way to interpolate only over short gaps in a Pandas DataFrame that has a DateTimeIndex. Long gaps should be kept as they are.
df = pd.DataFrame(
{ "value": [ 1, np.nan, 3, np.nan, np.nan, 5, np.nan, 11, np.nan, 21, np.nan, 41 ] },
index=pd.to_datetime( [
"2021-01-01 00:00", "2021-01-01 00:05", "2021-01-01 00:10",
"2021-01-01 00:11", "2021-01-01 00:13", "2021-01-01 00:14",
"2021-01-01 00:15", "2021-01-01 01:30", "2021-01-01 03:00",
"2021-01-01 04:00", "2021-01-01 05:45", "2021-01-01 06:45",
] )
)
value
2021-01-01 00:00:00 1.0
2021-01-01 00:05:00 NaN
2021-01-01 00:10:00 3.0
2021-01-01 00:11:00 NaN
2021-01-01 00:13:00 NaN
2021-01-01 00:14:00 5.0
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.0
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.0
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.0
The idea is to keep gaps that are longer than a certain time (>5 minutes in this case), but interpolate all missing values within shorter gaps.
interpolate()
has a limit
argument that limits the number of missing values to be interpolated, but this does not respect the time delta between the rows, only the number of rows.
I would like the result to be like this:
value
2021-01-01 00:00:00 1.000000
2021-01-01 00:05:00 2.000000
2021-01-01 00:10:00 3.000000
2021-01-01 00:11:00 3.500000
2021-01-01 00:13:00 4.500000
2021-01-01 00:14:00 5.000000
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.000000
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.000000
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.000000
Upvotes: 1
Views: 1115
Reputation: 3720
This solution fills value gaps that are in time spans that are less than a specified value. The filled values are set proportionally to the entry's position within the value gap's time span (time-interpolated values). Julian dates are used for easier computation.
Set max time span gap to fill with time-interpolated values. 5 minutes.
jd_max_gap_fill = 5/(60*24)
Calculate the value gap:
df['ffill'] = df['value'].ffill()
df['value_gap'] = df['value'].bfill() - df['value'].ffill()
Get the Julian date for the entry:
df['jd'] = df.index.to_julian_date()
Calculate the time gap:
df['jd_nan'] = np.where(~df['value'].isna(), df['jd'], np.nan)
df['jd_gap'] = df['jd_nan'].bfill() - df['jd_nan'].ffill()
Time-wise, calculate how far into the value gap we are:
df['jd_start'] = df['jd_nan'].ffill()
df['jd_prp'] = np.where(df['jd_gap'] != 0, (df['jd'] - df['jd_start'])/df['jd_gap'], 0)
Calculate time-interpolated values:
df['filled_value'] = np.where(df['jd_gap'] <= jd_max_gap_fill, df['ffill'] + df['value_gap'] * df['jd_prp'], np.nan)
df['filled_value']
2021-01-01 00:00:00 1.0
2021-01-01 00:05:00 NaN
2021-01-01 00:10:00 3.0
2021-01-01 00:11:00 3.5
2021-01-01 00:13:00 4.5
2021-01-01 00:14:00 5.0
2021-01-01 00:15:00 NaN
2021-01-01 01:30:00 11.0
2021-01-01 03:00:00 NaN
2021-01-01 04:00:00 21.0
2021-01-01 05:45:00 NaN
2021-01-01 06:45:00 41.0
Note that my output is different than your expected output because the first NaN
is in a 10 minute gap.
Upvotes: 1