Reputation: 1214

How can I fit two linear functions to a set of data points?

I have a set of data points, which appear sort of like a line with a curve near the beginning. See the image below, which shows the points with a line of best fit (fit to the whole data set).

Instead, they could be described by two linear functions (a line through the leftmost set of points and a separate line through the rest of the data points). What these points actually correspond to are neutron decay, which contains two different isotopes.

I don't know which points correspond to which isotope, so I need to somehow make a best guess. The curve for one isotope will be a straight line, and the curve for the other will be a different straight line. How can I fit two different lines of best fit (linear) to the set of data points, such that the fitting for both is optimized?

One idea I had is to pick a "cutouff point", say at t=100 (x-axis), and fit the points on the left to a line, and the points to the right to a different line. I could then compute chi^2 for both lines in order to get the "goodness" of the fits. Then, I could just keep doing the same thing many times with slightly different cutoff points, until I find the pair of lines that gives the best overall fits.

A different idea, which seems more involved, is to describe these data points as a combination of the two lines, y= m1*t + m2*t + b1 + b2, where the ms are the slopes and bs are the y-intercepts. Then, taking the derivative of the total curve, I would have dy/dt = m1+m2. Then perhaps I could cycle through different "cutouff points", and fit to lines until I get a combination where the derivative is closest to m1+m2. But I'm not sure how to do this, since I'm not working with one curve initially, just a bunch of discrete data points.

What would be the best way to go about this optimization of two fits in Python?

Upvotes: 6

Answers (3)

James Phillips

Reputation: 4647

Here is an example of fitting two straight lines to one data set, with the crossover point between the two lines also fitted as a parameter. This example uses scipy's Differential Evolution (DE) genetic algorithm to determine initial parameter estimates. The scipy implementation of DE uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, and this algorithm requires bounds within which to search - in this example those bounds are taken from the data maximum and minimum values.

import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings

xData = numpy.array([19.1647, 18.0189, 16.9550, 15.7683, 14.7044, 13.6269, 12.6040, 11.4309, 10.2987, 9.23465, 8.18440, 7.89789, 7.62498, 7.36571, 7.01106, 6.71094, 6.46548, 6.27436, 6.16543, 6.05569, 5.91904, 5.78247, 5.53661, 4.85425, 4.29468, 3.74888, 3.16206, 2.58882, 1.93371, 1.52426, 1.14211, 0.719035, 0.377708, 0.0226971, -0.223181, -0.537231, -0.878491, -1.27484, -1.45266, -1.57583, -1.61717])
yData = numpy.array([0.644557, 0.641059, 0.637555, 0.634059, 0.634135, 0.631825, 0.631899, 0.627209, 0.622516, 0.617818, 0.616103, 0.613736, 0.610175, 0.606613, 0.605445, 0.603676, 0.604887, 0.600127, 0.604909, 0.588207, 0.581056, 0.576292, 0.566761, 0.555472, 0.545367, 0.538842, 0.529336, 0.518635, 0.506747, 0.499018, 0.491885, 0.484754, 0.475230, 0.464514, 0.454387, 0.444861, 0.437128, 0.415076, 0.401363, 0.390034, 0.378698])


def func(xArray, breakpoint, slopeA, offsetA, slopeB, offsetB):
    returnArray = []
    for x in xArray:
        if x < breakpoint:
            returnArray.append(slopeA * x + offsetA)
        else:
            returnArray.append(slopeB * x + offsetB)
    return returnArray


# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
    warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
    val = func(xData, *parameterTuple)
    return numpy.sum((yData - val) ** 2.0)


def generate_Initial_Parameters():
    # min and max used for bounds
    maxX = max(xData)
    minX = min(xData)
    maxY = max(yData)
    minY = min(yData)
    slope = 10.0 * (maxY - minY) / (maxX - minX) # times 10 for safety margin

    parameterBounds = []
    parameterBounds.append([minX, maxX]) # search bounds for breakpoint
    parameterBounds.append([-slope, slope]) # search bounds for slopeA
    parameterBounds.append([minY, maxY]) # search bounds for offsetA
    parameterBounds.append([-slope, slope]) # search bounds for slopeB
    parameterBounds.append([minY, maxY]) # search bounds for offsetB


    result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
    return result.x

# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()

# call curve_fit without passing bounds from genetic algorithm
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Parameters:', fittedParameters)
print()

modelPredictions = func(xData, *fittedParameters) 

absError = modelPredictions - yData

SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))

print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)

print()


##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
    axes = f.add_subplot(111)

    # first the raw data as a scatter plot
    axes.plot(xData, yData,  'D')

    # create data for the fitted equation plot
    xModel = numpy.linspace(min(xData), max(xData))
    yModel = func(xModel, *fittedParameters)

    # now the model as a line plot
    axes.plot(xModel, yModel)

    axes.set_xlabel('X Data') # X axis data label
    axes.set_ylabel('Y Data') # Y axis data label

    plt.show()
    plt.close('all') # clean up after using pyplot

graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Upvotes: 1

Felix

Reputation: 1905

This can be interpreted as a time-series segmentation problem in combination with linear regression. There are multiple approaches to tackle this problem. One of these you already mentioned: a manually selected point where to segment the data, another one is trying to minimize the error.

First I try to recreate the data:

import numpy as np; import matplotlib.pyplot as plt
y1 = np.linspace(5.5, 3.7, num=100)
y1 = y1 + np.random.rand(y1.shape[0]) * np.linspace(0, .3, num=y1.shape[0])
y2 = np.linspace(3.7, 1.1, num=500)
y2 = y2 + np.random.rand(y2.shape[0]) * np.linspace(0.3, 1.9, num=y2.shape[0])
y = np.append(y1, y2)
x = np.array(range(len(y)))

Then I do two linear fits using numpy.linalg.lstsq which itself is based on the method of least squares:

x1 = x[:100]
y1 = y[:100]
A1 = np.vstack([x1, np.ones(len(x1))]).T
m1, c1 = np.linalg.lstsq(A1, y1, rcond=None)[0]

x2 = x[100:]
y2 = y[100:]
A2 = np.vstack([x2, np.ones(len(x2))]).T
m2, c2 = np.linalg.lstsq(A2, y2, rcond=None)[0]

Plotting this results in the following image:

plt.scatter(x, y)
plt.plot(x1, m1 * x1 + c1, 'r')
plt.plot(x2, m2 * x2 + c2, 'r')
plt.show()

You now could use an automatic segmentation algorithm like SWAB to replace the [100:] and [:100] slices, but I'd advise against it, if you are able to manually decide, at which point to split the data. Have a look at the link provided, if you're looking for an implementation.

Upvotes: 6

Aayush Mahajan

Reputation: 4033

Formally, the second approach you mentioned is essentially trying to fit your data onto a conic section. Essentially, a pair of straight lines (POSL) is a type of conic section, and hence can be represented by the general conic equation ax^2 + by^2 + 2hxy + 2gx + 2fy + c = 0 (instead of the one you mentioned in your question, which would just end up being a single straight line of slope m1+m2). To clarify, there certainly will be an equation of the form above, that when plotted gives a POSL that fits your data. Whether your algorithm can converge to it is a different story.

You can try to use methods to find the coefficients a,b,h,g,f & c. Ideally, the coefficients you would get for the conic section would form a POSL, which would fit your dataset closely.

If you decide to implement this, you will have to keep in mind that this general equation can represent a lot of shapes like a parabola, hyperbola, etc. It might be possible that after training you find that the regression gets stuck, or doesn't converge and takes a different shape, like that of a parabola. You can try to nudge regression towards a POSL shape by rewarding adherence to these conditions nescessary for a conic section to be a POSL in the regression approach. This might overcomplicate things though.

This method is very neat mathematically, but I'd wager that trying to get your trained model to converge to a POSL would be equivalent to balancing something on the edge of a knife (condition for POSL is very narrow, essentially). Most probably, it would end up giving you an equation of a parabola, ellipse or hyperbola (which might just end up fitting your dataset sufficiently, making even the non-optimal conic regression solution worthwhile).

Still, if you don't find the results satisfactory, then a better approach might just be to stop worrying about the shape and use a neural network for this form of nonlinear regression. That or you could eyeball the elbow point and divide your dataset about it as you suggested in the first approach.

Upvotes: 0

How can I fit two linear functions to a set of data points?

Answers (3)

Related Questions