Gabriel
Gabriel

Reputation: 42439

Best fit from a set of curves to data points

I have a set of curves F={f1, f2, f3,..., fN}, each of them defined through a set of points, ie: I don't have the explicit form of the functions. So I have a set of N tables like so:

#f1: x  y
1.2  0.5
0.6  5.6
0.3  1.2
...

#f2: x  y
0.3  0.1
1.2  4.1
0.8  2.2
...

#fN: x  y
0.7  0.3
0.3  1.1
0.1  0.4
...

I also have a set of observed/measured data points O=[p1, p2, p3,..., pM] where each point has x, y coordinates and a given weight between [0, 1] , so it looks like:

#O: x  y  w
0.2  1.6  0.5
0.3  0.7  0.3
0.1  0.9  0.8
...

Since N ~ 10000 (I have a big number of functions) what I'm looking for is an efficient (more precisely: fast) way to find the curve that best fits my set of observed and weighted points O.

I know how to find a best fit with python when I have the explicit form of the functions (scipy.optimize.curve_fit), but how do I do that when I have the functions defined as tables?

Upvotes: 1

Views: 2359

Answers (3)

Oliver Dain
Oliver Dain

Reputation: 9963

Here's one possible solution. This combines some of the comments on your original post, and @elyase's solution above. @elyase has provided one way to interpolate between the points you have for each function. Given that, and the definition of best fit being weighted sum of squares, I think the following does what you want:

# Here a model is an interpolated function as per @elyase's solution above
min_score = sys.float_info.max
best_model = None
for model in models:
    # data is an array of (x, y, weight) tuples
    score = 0.0
    for data_point in data:
        w = data_point[2]
        x = data_point[0]
        y = data_point[1]
        score += w * (y - model.get_y(x)) ** 2
    if score < min_score:
        best_model = model
return best_model

You mention that you need a "fast" solution. Based on your answers above, doing the above for each set of data results in about 2 million iterations total. That shouldn't be more than a few seconds, even with Python. Is that fast enough?

If not, things get much more complex. For example, you could try to store your models (you call them functions above) in sorted order such that model1 > model2 if model1(x) > model2(x) for all x (given the interpolation thing above). That defines only a partial order, but that might be enough to be very helpful if your models have the right properties. Given that, you can do something akin to binary search. Alternatively, you could do a branch-and-bound thing where the bound is given by the distance between the first value in data and the first value in the function. Depending on the nature of your functions and data that might or might not help. There are solutions you could consider if you required an almost exact, but not necessarily optimal answer, etc., etc., etc. In short, to go beyond the trivial answer above, I think we need to know more about your time constraints, the data, and the models.

Upvotes: 1

Luke
Luke

Reputation: 11644

Here's my suggested approach:

  1. Place all functions in a single numpy array
  2. Compute squared-distance between each point in your test data and each point in each function (you could also compute exact distance, but sqrt is expensive)
  3. Compute error as weighted sum of distances (or modify this to your liking)
  4. Find minimum error

For example:

import numpy as np

# define an array of N=3 functions
funcs = np.array([
    [[0, 1, 2, 3, 4, 5],  # x1
     [0, 1, 2, 1, 0, 0]], # y1
    [[0, 1, 2, 3, 4, 5],  # x2
     [0, 0, 0, 1, 2, 3]], # y2
    [[0, 1, 2, 3, 4, 5],  # x3
     [5, 4, 3, 2, 1, 0]]  # y3
    ], dtype=float)

# define the test data and weights with the same
# dimensions as function array
data = np.array([
    [[0, 1, 2, 3, 4, 5],  # x
     [0, 1, 2, 2, 1, 0]]  # y
    ], dtype=float)

weight = np.array([
    [0.1, 0.2, 0.3, 0, 0, 0]  # w
    ])

# compute distance between points in data and each function:
dist = ((funcs - data) ** 2).sum(axis=1)

# compute weighted error across all functions:
err = (dist * weight).sum(axis=1)

print "Errors:", err
print "Best fit:", np.argmin(err)

Upvotes: 1

elyase
elyase

Reputation: 40993

You need two elements in order to have a fit, the data(which you already have) and a model space(Linear Models, Gaussian Process, Support Vector Regression). In your case your model has the additional constrain that some data points should be weighted differently than others. May be something like this works from you:

from scipy.interpolate import UnivariateSpline

temp = np.asarray([10, 9.6, 9.3, 9.0, 8.7])
height = np.asarray([129, 145, 167, 190, 213])
f = UnivariateSpline(height, temp)

Now you can evaluate f wherever you want:

test_points = np.arange(120, 213, 5)  
plot(height, temp, 'o', regular_heights, f(test_points), 'x')

enter image description here

Upvotes: 4

Related Questions