DocZerø
DocZerø

Reputation: 8557

Interpolate pandas missing datetime64 values

I'm using Pandas 0.23.4 with Python 3.7.1.

I've been gathering ICMP ping replies and want to analyse them with pandas.

The input file contains data like:

13:27:19.651425 64 bytes from 1.1.1.1: icmp_seq=26 ttl=57 time=15.643 ms
13:27:20.652464 64 bytes from 1.1.1.1: icmp_seq=27 ttl=57 time=11.957 ms
13:27:21.653371 64 bytes from 1.1.1.1: icmp_seq=28 ttl=57 time=12.201 ms

Data extracted: sequence, timestamp, ttl and time elapsed.

However, occasionally there's a timeout, which looks like this:

Request timeout for icmp_seq 6478

In this case, the only piece of data I can parse is the sequence.

My dataframe's dtypes looks like this:

elapsed              object
timestamp    datetime64[ns]
ttl                  object
dtype: object

The index is a Int64Index (the sequence column).

Timeouts that have been recorded will contain NA (NAT for the timestamp). What I'd like to do is interpolate the value for the timestamp column, as I have a value before and after the timeout.

However, if I try:

df_ping.timestamp.interpolate(method='linear')

It still returns

5060   2018-12-11 14:51:28.704059
5061                          NaT
5062                          NaT
5063                          NaT
5064                          NaT
5065                          NaT
5066                          NaT
5067                          NaT
5068                          NaT
5068   2018-12-11 14:51:38.255034
5069   2018-12-11 14:51:38.255073

Another example:

4673   2018-12-11 14:45:00.769315
4674                          NaT
4675   2018-12-11 14:45:02.449024

Any reason why the interpolation doesn't seem to do anything?

Upvotes: 1

Views: 1142

Answers (1)

ALollz
ALollz

Reputation: 59579

The interpolation methods expect real numbers, not datetime objects. You need to convert your numbers to floats. Do this by subtracting off the smallest timestamp so that you have enough precision to store your numbers. Interpolate and add the offset back.

import pandas as pd

t0 = df.timestamp.min()
m = df.timestamp.notnull()
df.loc[m, 't_int'] = (df.loc[m, 'timestamp'] - t0).dt.total_seconds()

df['timestamp'] = t0 + pd.to_timedelta(df.t_int.interpolate(), unit='s')

Output:

                         timestamp     t_int
5060 2018-12-11 14:51:28.704059000  0.000000
5061 2018-12-11 14:51:29.765278444       NaN
5062 2018-12-11 14:51:30.826497889       NaN
5063 2018-12-11 14:51:31.887717333       NaN
5064 2018-12-11 14:51:32.948936778       NaN
5065 2018-12-11 14:51:34.010156222       NaN
5066 2018-12-11 14:51:35.071375667       NaN
5067 2018-12-11 14:51:36.132595111       NaN
5068 2018-12-11 14:51:37.193814556       NaN
5068 2018-12-11 14:51:38.255034000  9.550975
5069 2018-12-11 14:51:38.255073000  9.551014

Upvotes: 3

Related Questions