Jezzamon
Jezzamon

Reputation: 1491

Interpolate single value from time series

I have a relatively large (~300 MB) set of geolocation data, where the format is

Timestamp, id, type, x, y

With the following data types:

In[7]: df.dtypes
Out[7]: 
Timestamp    datetime64[ns]
id                    int64
type                 object
X                     int64
Y                     int64
dtype: object

Each id corresponds to a particular user, and each person has several hundred points recorded across the day.

I want to create a plot showing where everyone is at a certain second. So I need 1 point for every id. However, the data is somewhat sparse, and it's unlikely there's a data point that correlates precisely with that second. I want to approximate by interpolating between the closest two points.

Between data points, I'm assuming people move linearly, so that if we know the location at 8:31:10 and 8:31:50, then at 8:31:30 they should be exactly halfway between the two locations, and at 8:31:11 they should be 1/40th of the way between the points (so interpolating as described here: Pandas data frame: resample with linear interpolation)

I'm thinking the basic process would be:

I know I can loop through each id with

for name, group in df.groupby('id'):

and plotting isn't a problem, but I'm not sure about the rest.

After a bit of searching I haven't found any good way to do this for a single value from each group. Other answers suggest using the resample and interpolate functions, but that will take way too long with the size of data I have, and does a lot of unnecessary calculations seeing as I only need one point.

Upvotes: 1

Views: 554

Answers (1)

Severin Pappadeux
Severin Pappadeux

Reputation: 20080

It is not quite clear what you want, but lets start with something

First, you probably need list of unique IDs, right?

import pandas as pd
import numpy as np

df = ...

unids = np.unique(df[['id']])

for id in unids:
    df_id = # subset df by id, filtering out rows by id, and get back dataframe
    # sort new df by Timestamp
    tmin = new_df['Timestamp'][0]
    tmax = new_df['Timestamp'][-1]
    tstep = ... # time step

    position = []
    for t in range(tmin, tmax, tstep):
        # interpolate
        # add to position
    plot(position)

is this looks reasonable?

Upvotes: 1

Related Questions