Oam
Oam

Reputation: 345

Order of spline interpolation for pandas dataframe

I have the following dataframe which shows data from Motion Capture, where each column is a marker (i.e. position data) and rows are time:

        LTHMB X RTHMB X
0       932.109 872.921
1       934.605 873.798
2       932.383 873.998
3       940.946 875.609
4       941.549 875.875
...     ...     ...
14765   NaN 602.700
14766   562.350 NaN
14767   562.394 NaN
14768   562.421 NaN
14769   562.490 602.705

In the data, there are some NaN values that I need to fill. I'm not really an expert in this so I'm not sure what is the best way to fill these.

I know I can do forward/backward fill, and I also read about spline interpolation, which seems more sophisticated. In the documentation for pandas.DataFrame.interpolate it states that for spline you have to specify the order.

What would I use for the order in this case? Each marker has an X, Y and Z. Does that mean I'd use a cubic spline, or is it not that simple?

Upvotes: 2

Views: 8265

Answers (1)

Akshay Sehgal
Akshay Sehgal

Reputation: 19322

The order of spline has nothing to do with the number of features that you have in the dataset. Each feature will be interpolated independently to each other. Before applying an algorithm it is therefore important to understand how it works and what each of its parameters (such as 'order') contributes towards.

For intuition, a cubic (order = 3) spline is the process of constructing a spline which consists of "piecewise" polynomials of degree three.

enter image description here

Note that all polynomials are just valid within an interval; they compose the interpolation function. While extrapolation predicts a development outside the range of the data, interpolation works just within the data boundaries.

The "order" of the spline is the order of these "piecewise" polynomials.

enter image description here Source: Google

As you can see, a linear spline (order=1) fits degree one polynomials (straight ines) between the ranges, while a 7th order Spline fits 7th order polynomials.


Which should you use?

No one can simply tell you which would be a better fit. You will have to visualize it to see if a specific interpolation technique is able to give you a relevant imputation or not.

The only way you can guarantee that you are using the right interpolation technique is by comparing them with R2_score. You can do the following -

  1. Take a complete sequence from your data (no missing values)
  2. Randomly set a percentage of this data as missing (keep these hidden values separately)
  3. Try multiple interplotation methods to complete the sequence (use order 3, 5, 7 splines etc)
  4. Take the predicted sequence and compare it to the actual sequence using R2_score.
  5. The one with the highest r2_score is the one that should fit your data the best
  6. Repeat this multiple times, at multiple % of injected missing data to form a valid study on which one is better that other in general.

You can find this approach implemented roughtly here

enter image description here

Upvotes: 6

Related Questions