How to improve the performance when 2d interpolating/smoothing lines using scipy?

I have a moderate size data set, namely 20000 x 2 floats in a two column matrix. The first column is the the x column which represents the distance to the original point along a trajectory, another column is the y column which represents the work has done to the object. This data set is obtained from lab operations, so it's fairly arbitrary. I've already turned this structure into numpy array. I want to plot y vs x in a figure with a smooth curve. So I hope the following code could help me:

x_smooth = np.linspace(x.min(),x.max(), 20000)
y_smooth = spline(x, y, x_smooth)
plt.plot(x_smooth, y_smooth)
plt.show()

However, when my program execute the line y_smooth = spline(x,y,x_smooth), it takes a very long time,say 10 min, and even sometimes it will blow my memory that I have to restart my machine. I tried to reduce the chunk number to 200 and 2000 and none of them works. Then I checked the official scipy reference: scipy.interpolate.spline here. And they said that spline is deprecated in v 0.19, but I'm not using the new version. If spline is deprecated for quite a bit of the time, how to use the equivalent Bspline now? If spline is still functioning, then what causes the slow performance

One portion of my data could look like this:

13.202      0.0
13.234738      -0.051354643759
12.999116      0.144464320836
12.86252      0.07396528119
13.1157      0.10019738758
13.357109      -0.30288563381
13.234004      -0.045792536285
12.836279      0.0362257166275
12.851597      0.0542649286915
13.110691      0.105297378401
13.220619      -0.0182963209185
13.092143      0.116647353635
12.545676      -0.641112204849
12.728248      -0.147460703493
12.874176      0.0755861585235
12.746764      -0.111583725833
13.024995      0.148079528382
13.106033      0.119481137144
13.327233      -0.197666132456
13.142423      0.0901867159545

Upvotes: 2

Answers (3)

RandomEli

Reputation: 1557

My problem can be generalize to how to smoothly plot 2d graphs when data points are randomized. Since you are only dealing with two columns of data, if you sort your data by independent variable, at least your data points will be connected in order, and that's how matplotlib connects your data points.

@Dawid Laszuk has provided one solution to sort data by independent variable, and I'll display mine here:

plotting_columns = []
    for i in range(len(x)):
        plotting_columns.append(np.array([x[i],y[i]]))
    plotting_columns.sort(key=lambda pair : pair[0])
    plotting_columns = np.array(plotting_columns)

traditional sort() by filter condition could also do the sorting job efficient here.

But it's just your first step. The following steps are not hard either, to smooth your graph, you also want to keep your independent variable in linear ascending order with identical step interval, so

x_smooth = np.linspace(x.min(), x.max(), num_steps)

is enough to do the job. Usually, if you have plenty of data points, for example, more than 10000 points (correctness and accuracy are not human verifiable), you just want to plot the significant points to display the trend, then only smoothing x is enough. So you can plt.plot(x_smooth,y) simply.

You will notice that x_smooth will generate many x values that will not have corresponding y value. When you want to maintain the correctness, you need to use line fitting functions. As @ev-br demonstrated in his answer, spline functions are expensive on purpose. Therefore you might want to do some simpler trick. I smoothed my graph without using those functions. And you have some simple steps to it.

First, round your values so that your data will not vary too much in small intervals. (You can skip this step) You can change one line when you constructing the plotting_columns as:

plotting_columns.append(np.around(np.array(x[i],y[i]), decimal=4))

After done this, you can filter out the point that you don't want to plot by choosing the points close to the x_smooth values:

new_plots = []
for i in range(len(x_smooth)):
    if plotting_columns[:,0][i] >= x_smooth[i] - error and plotting_columns[:,0][i]< x_smooth[i] + error:
        new_plots.append(plotting_columns[i])
    else:
        # Remove all points between the interval #

This is how I solved my problems.

Upvotes: 0

ev-br

Reputation: 26090

Several issues here. First and foremost, spline fitting you're trying to use is global. This means that you're solving a system of linear equations of the size 20000 at the construction time (evaluations are weakly sensitive to the dataset size though). This explains why the spline construction is slow.

scipy.interpolate.spline, furthermore, does linear algebra with full matrices --- hence memory consumption. This is precisely why it's deprecated from scipy 0.19.0 on.

The recommended replacement, available in scipy 0.19.0, is the BSpline/ make_interp_spline combo:

>>> spl = make_interp_spline(x, y, k=3)    # returns a BSpline object
>>> y_new = spl(x_new)                     # evaluate

Notice it is not BSpline(x, y, k): BSpline objects do not know anything about the data or fitting or interpolation.

If you are using older scipy versions, your options are:

CubicSpline(x, y) for cubic splines
splrep(x, y, s=0) / splev combo.

However, you may want to think if you really need twice continuously differentiable functions. If only once differentiable functions are smooth enough for your purposes, then you can use local spline interpolations, e.g. Akima1DInterpolator or PchipInterpolator:

In [1]: import numpy as np

In [2]: from scipy.interpolate import pchip, splmake

In [3]: x = np.arange(1000)

In [4]: y = x**2

In [5]: %timeit pchip(x, y)
10 loops, best of 3: 58.9 ms per loop

In [6]: %timeit splmake(x, y)    
1 loop, best of 3: 5.01 s per loop

Here splmake is what spline uses under the hood, and it's also deprecated.

Upvotes: 3

Dawid Laszuk

Reputation: 1978

Most interpolation methods in SciPy are function-generating, i.e. they return function which you can then execute on your x data. For example, using CubicSpline method, which connects all points with pointwise cubic spline would be

from scipy.interpolate import CubicSpline

spline = CubicSpline(x, y)
y_smooth = spline(x_smooth)

Based on your description I think that you correctly want to use BSpline. To do so, follow the pattern above, i.e.

from scipy.interpolate import BSpline

order = 2 # smoothness order
spline = BSpline(x, y, order)
y_smooth = spline(x_smooth)

Since you have such amount of data, it probably must be very noisy. I'd suggest using bigger spline order, which relates to the number of knots used for interpolation.

In both cases, your knots, i.e. x and y, should be sorted. These are 1D interpolation (since you are using only x_smooth as input). You can sort them using np.argsort. In short:

from scipy.interpolate import BSpline

sort_idx = np.argsort(x)
x_sorted = x[sort_idx]
y_sorted = y[sort_idx]

order = 20 # smoothness order
spline = BSpline(x_sorted, y_sorted, order)
y_smooth = spline(x_smooth)

plt.plot(x_sorted, y_sorted, '.')
plt.plot(x_smooth, y_smooth, '-')
plt.show()

Upvotes: 1

How to improve the performance when 2d interpolating/smoothing lines using scipy?

Answers (3)

Related Questions