curious_cosmo
curious_cosmo

Reputation: 1214

How can I fit two linear functions to a set of data points?

I have a set of data points, which appear sort of like a line with a curve near the beginning. See the image below, which shows the points with a line of best fit (fit to the whole data set).

Plot with one line of best fit

Instead, they could be described by two linear functions (a line through the leftmost set of points and a separate line through the rest of the data points). What these points actually correspond to are neutron decay, which contains two different isotopes.

I don't know which points correspond to which isotope, so I need to somehow make a best guess. The curve for one isotope will be a straight line, and the curve for the other will be a different straight line. How can I fit two different lines of best fit (linear) to the set of data points, such that the fitting for both is optimized?

One idea I had is to pick a "cutouff point", say at t=100 (x-axis), and fit the points on the left to a line, and the points to the right to a different line. I could then compute chi^2 for both lines in order to get the "goodness" of the fits. Then, I could just keep doing the same thing many times with slightly different cutoff points, until I find the pair of lines that gives the best overall fits.

A different idea, which seems more involved, is to describe these data points as a combination of the two lines, y= m1*t + m2*t + b1 + b2, where the ms are the slopes and bs are the y-intercepts. Then, taking the derivative of the total curve, I would have dy/dt = m1+m2. Then perhaps I could cycle through different "cutouff points", and fit to lines until I get a combination where the derivative is closest to m1+m2. But I'm not sure how to do this, since I'm not working with one curve initially, just a bunch of discrete data points.

What would be the best way to go about this optimization of two fits in Python?

Upvotes: 6

Views: 8668

Answers (3)

James Phillips
James Phillips

Reputation: 4647

Here is an example of fitting two straight lines to one data set, with the crossover point between the two lines also fitted as a parameter. This example uses scipy's Differential Evolution (DE) genetic algorithm to determine initial parameter estimates. The scipy implementation of DE uses the Latin Hypercube algorithm to ensure a thorough search of parameter space, and this algorithm requires bounds within which to search - in this example those bounds are taken from the data maximum and minimum values.

import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings

xData = numpy.array([19.1647, 18.0189, 16.9550, 15.7683, 14.7044, 13.6269, 12.6040, 11.4309, 10.2987, 9.23465, 8.18440, 7.89789, 7.62498, 7.36571, 7.01106, 6.71094, 6.46548, 6.27436, 6.16543, 6.05569, 5.91904, 5.78247, 5.53661, 4.85425, 4.29468, 3.74888, 3.16206, 2.58882, 1.93371, 1.52426, 1.14211, 0.719035, 0.377708, 0.0226971, -0.223181, -0.537231, -0.878491, -1.27484, -1.45266, -1.57583, -1.61717])
yData = numpy.array([0.644557, 0.641059, 0.637555, 0.634059, 0.634135, 0.631825, 0.631899, 0.627209, 0.622516, 0.617818, 0.616103, 0.613736, 0.610175, 0.606613, 0.605445, 0.603676, 0.604887, 0.600127, 0.604909, 0.588207, 0.581056, 0.576292, 0.566761, 0.555472, 0.545367, 0.538842, 0.529336, 0.518635, 0.506747, 0.499018, 0.491885, 0.484754, 0.475230, 0.464514, 0.454387, 0.444861, 0.437128, 0.415076, 0.401363, 0.390034, 0.378698])


def func(xArray, breakpoint, slopeA, offsetA, slopeB, offsetB):
    returnArray = []
    for x in xArray:
        if x < breakpoint:
            returnArray.append(slopeA * x + offsetA)
        else:
            returnArray.append(slopeB * x + offsetB)
    return returnArray


# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
    warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
    val = func(xData, *parameterTuple)
    return numpy.sum((yData - val) ** 2.0)


def generate_Initial_Parameters():
    # min and max used for bounds
    maxX = max(xData)
    minX = min(xData)
    maxY = max(yData)
    minY = min(yData)
    slope = 10.0 * (maxY - minY) / (maxX - minX) # times 10 for safety margin

    parameterBounds = []
    parameterBounds.append([minX, maxX]) # search bounds for breakpoint
    parameterBounds.append([-slope, slope]) # search bounds for slopeA
    parameterBounds.append([minY, maxY]) # search bounds for offsetA
    parameterBounds.append([-slope, slope]) # search bounds for slopeB
    parameterBounds.append([minY, maxY]) # search bounds for offsetB


    result = differential_evolution(sumOfSquaredError, parameterBounds, seed=3)
    return result.x

# by default, differential_evolution completes by calling curve_fit() using parameter bounds
geneticParameters = generate_Initial_Parameters()

# call curve_fit without passing bounds from genetic algorithm
fittedParameters, pcov = curve_fit(func, xData, yData, geneticParameters)
print('Parameters:', fittedParameters)
print()

modelPredictions = func(xData, *fittedParameters) 

absError = modelPredictions - yData

SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))

print()
print('RMSE:', RMSE)
print('R-squared:', Rsquared)

print()


##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
    f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
    axes = f.add_subplot(111)

    # first the raw data as a scatter plot
    axes.plot(xData, yData,  'D')

    # create data for the fitted equation plot
    xModel = numpy.linspace(min(xData), max(xData))
    yModel = func(xModel, *fittedParameters)

    # now the model as a line plot
    axes.plot(xModel, yModel)

    axes.set_xlabel('X Data') # X axis data label
    axes.set_ylabel('Y Data') # Y axis data label

    plt.show()
    plt.close('all') # clean up after using pyplot

graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

Upvotes: 1

Felix
Felix

Reputation: 1905

This can be interpreted as a time-series segmentation problem in combination with linear regression. There are multiple approaches to tackle this problem. One of these you already mentioned: a manually selected point where to segment the data, another one is trying to minimize the error.

First I try to recreate the data:

import numpy as np; import matplotlib.pyplot as plt
y1 = np.linspace(5.5, 3.7, num=100)
y1 = y1 + np.random.rand(y1.shape[0]) * np.linspace(0, .3, num=y1.shape[0])
y2 = np.linspace(3.7, 1.1, num=500)
y2 = y2 + np.random.rand(y2.shape[0]) * np.linspace(0.3, 1.9, num=y2.shape[0])
y = np.append(y1, y2)
x = np.array(range(len(y)))

Then I do two linear fits using numpy.linalg.lstsq which itself is based on the method of least squares:

x1 = x[:100]
y1 = y[:100]
A1 = np.vstack([x1, np.ones(len(x1))]).T
m1, c1 = np.linalg.lstsq(A1, y1, rcond=None)[0]

x2 = x[100:]
y2 = y[100:]
A2 = np.vstack([x2, np.ones(len(x2))]).T
m2, c2 = np.linalg.lstsq(A2, y2, rcond=None)[0]

Plotting this results in the following image:

plt.scatter(x, y)
plt.plot(x1, m1 * x1 + c1, 'r')
plt.plot(x2, m2 * x2 + c2, 'r')
plt.show()

plot

You now could use an automatic segmentation algorithm like SWAB to replace the [100:] and [:100] slices, but I'd advise against it, if you are able to manually decide, at which point to split the data. Have a look at the link provided, if you're looking for an implementation.

Upvotes: 6

Aayush Mahajan
Aayush Mahajan

Reputation: 4033

Formally, the second approach you mentioned is essentially trying to fit your data onto a conic section. Essentially, a pair of straight lines (POSL) is a type of conic section, and hence can be represented by the general conic equation ax^2 + by^2 + 2hxy + 2gx + 2fy + c = 0 (instead of the one you mentioned in your question, which would just end up being a single straight line of slope m1+m2). To clarify, there certainly will be an equation of the form above, that when plotted gives a POSL that fits your data. Whether your algorithm can converge to it is a different story.

You can try to use methods to find the coefficients a,b,h,g,f & c. Ideally, the coefficients you would get for the conic section would form a POSL, which would fit your dataset closely.

If you decide to implement this, you will have to keep in mind that this general equation can represent a lot of shapes like a parabola, hyperbola, etc. It might be possible that after training you find that the regression gets stuck, or doesn't converge and takes a different shape, like that of a parabola. You can try to nudge regression towards a POSL shape by rewarding adherence to these conditions nescessary for a conic section to be a POSL in the regression approach. This might overcomplicate things though.

This method is very neat mathematically, but I'd wager that trying to get your trained model to converge to a POSL would be equivalent to balancing something on the edge of a knife (condition for POSL is very narrow, essentially). Most probably, it would end up giving you an equation of a parabola, ellipse or hyperbola (which might just end up fitting your dataset sufficiently, making even the non-optimal conic regression solution worthwhile).

Still, if you don't find the results satisfactory, then a better approach might just be to stop worrying about the shape and use a neural network for this form of nonlinear regression. That or you could eyeball the elbow point and divide your dataset about it as you suggested in the first approach.

Upvotes: 0

Related Questions