find best match numeric sequence in sequence array

Assume I have these two arrays:

float arr[] = {40.4357,40.6135,40.2477,40.2864,39.3449,39.8901,40.103,39.9959,39.7863,39.9102,39.2652,39.2688,39.5147,38.2246,38.5376,38.4512,38.9951,39.0999,39.3057,38.53,38.2761,38.1722,37.8816,37.6521,37.8306,38.0853,37.9644,38.0626,38.0567,38.3518,38.4044,38.3553,38.4978,38.3768,38.2058,38.3175,38.3123,38.262,38.0093,38.3685,38.0111,38.4539,38.8122,39.1413,38.9409,39.2043,39.3538,39.4123,39.3628,39.2825,39.1898,39.0431,39.0634,38.5993,38.252,37.3793,36.6334,36.4009,35.2822,34.4262,34.2119,34.1552,34.3325,33.9626,33.2661,32.3819,35.1959,36.7602,37.9039,37.8103,37.5832,37.9718,38.3111,38.9323,38.6763,39.1163,38.8469,39.805,40.2627,40.3689,40.4064,40.0558,40.815,41.0234,41.0128,41.0296,41.0927,40.7046,40.6775,40.2711,40.1283,39.7518,40.0145,40.0394,39.8461,39.6317,39.5548,39.1996,38.9861,38.8507,38.8603,38.483,38.4711,38.4214,38.4286,38.5766,38.7532,38.7905,38.6029,38.4635,38.1403,36.6844,36.616,36.4053,34.7934,34.0226,33.0505,33.4978,34.6106,35.284,35.7535,35.3541,35.5481,35.4086,35.7096,36.0526,36.1222,35.9408,36.1007,36.7952,36.99,37.1024,37.0993,37.3144,36.6951,37.1213,38.0026,38.1266,39.2538,38.8963,39.0158,38.6235,38.7908,38.6041,38.4489,38.3207,37.7398,38.5304,38.925,38.7249,38.9221,39.1704,39.5113,40.0613,39.3602,39.8689,39.973,40.0524,40.0025,40.7584,40.9714,40.9106,40.9685,40.6554,39.7314,39.0044,38.7183,38.5163,38.6101,38.2004,38.7606,38.7532,37.8903,37.8403,38.5368,39.0462,38.8279,39.0748,39.2907,38.5447,38.423,38.5624,38.476,38.5784,39.0905,39.379,39.4739,39.5774,40.7036,40.3044,39.6162,39.9967,40.0562,39.3426,38.666,38.7561,39.2823,38.8548,37.6214,37.8188,38.1086,38.3619,38.5472,38.1357,38.1422,37.95,37.1837,37.4636,36.8852,37.1617,37.5051,37.7724,38.0879,37.7197,38.0422,37.8551,38.5688,38.8388};
float pattern[] = {38.6434,38.1409,37.3391,37.5457,37.7487,37.7499,37.6121,37.4789,37.5821,37.6541,38.0365,37.7907,37.9932,37.9945,37.7032,37.3556,37.6359,37.5412,37.5296,37.8829,38.3797,38.4452,39.0929,39.1233,39.3014,39.0317,38.903,38.8221,39.045,38.6944,39.0699,39.0978,38.9877,38.8123,38.7491,38.5888,38.7875,38.2086,37.7484,37.3961,36.8663,36.2607,35.8838,35.3297,35.5574,35.7239};

Ives uploaded this example graph:

As you can see in the graph pattern almost fits in the array at index 17

Whats the best and fastest way to find this index? And is there a way to have a confidence for there match fuse the values are not equal as you can see?

Upvotes: 3

Answers (3)

tobias_k

Reputation: 82899

If the starting index is your only degree of freedom, you can just try each index and calculate the sum of squared errors for each of the data points. In Python this could look like this:

data = [40.4357,40.6135,40.2477,...]
pattern = [38.6434,38.1409,37.3391,37.5457,37.7487,...]

best_ind, best_err = 0, 1e9999
for i in range(len(data) - len(pattern)):
    subdata = data[i : i + len(pattern)]
    err = sum((d-p)**2 for (d, p) in zip(subdata, pattern))
    if err < best_err:
        best_ind, best_err = i, err

Result:

>>> print best_ind, best_err
17 21.27929269

Upvotes: 1

usual me

Reputation: 8778

It's a one-liner in Python, using the fact that tuples are sorted lexicographically:

In [1]:

import numpy as np
arr = np.array( [ 40.4357,40.6135,40.2477,40.2864,39.3449,39.8901,40.103,39.9959,39.7863,39.9102,39.2652,39.2688,39.5147,38.2246,38.5376,38.4512,38.9951,39.0999,39.3057,38.53,38.2761,38.1722,37.8816,37.6521,37.8306,38.0853,37.9644,38.0626,38.0567,38.3518,38.4044,38.3553,38.4978,38.3768,38.2058,38.3175,38.3123,38.262,38.0093,38.3685,38.0111,38.4539,38.8122,39.1413,38.9409,39.2043,39.3538,39.4123,39.3628,39.2825,39.1898,39.0431,39.0634,38.5993,38.252,37.3793,36.6334,36.4009,35.2822,34.4262,34.2119,34.1552,34.3325,33.9626,33.2661,32.3819,35.1959,36.7602,37.9039,37.8103,37.5832,37.9718,38.3111,38.9323,38.6763,39.1163,38.8469,39.805,40.2627,40.3689,40.4064,40.0558,40.815,41.0234,41.0128,41.0296,41.0927,40.7046,40.6775,40.2711,40.1283,39.7518,40.0145,40.0394,39.8461,39.6317,39.5548,39.1996,38.9861,38.8507,38.8603,38.483,38.4711,38.4214,38.4286,38.5766,38.7532,38.7905,38.6029,38.4635,38.1403,36.6844,36.616,36.4053,34.7934,34.0226,33.0505,33.4978,34.6106,35.284,35.7535,35.3541,35.5481,35.4086,35.7096,36.0526,36.1222,35.9408,36.1007,36.7952,36.99,37.1024,37.0993,37.3144,36.6951,37.1213,38.0026,38.1266,39.2538,38.8963,39.0158,38.6235,38.7908,38.6041,38.4489,38.3207,37.7398,38.5304,38.925,38.7249,38.9221,39.1704,39.5113,40.0613,39.3602,39.8689,39.973,40.0524,40.0025,40.7584,40.9714,40.9106,40.9685,40.6554,39.7314,39.0044,38.7183,38.5163,38.6101,38.2004,38.7606,38.7532,37.8903,37.8403,38.5368,39.0462,38.8279,39.0748,39.2907,38.5447,38.423,38.5624,38.476,38.5784,39.0905,39.379,39.4739,39.5774,40.7036,40.3044,39.6162,39.9967,40.0562,39.3426,38.666,38.7561,39.2823,38.8548,37.6214,37.8188,38.1086,38.3619,38.5472,38.1357,38.1422,37.95,37.1837,37.4636,36.8852,37.1617,37.5051,37.7724,38.0879,37.7197,38.0422,37.8551,38.5688,38.8388] )
pattern = np.array( [ 38.6434,38.1409,37.3391,37.5457,37.7487,37.7499,37.6121,37.4789,37.5821,37.6541,38.0365,37.7907,37.9932,37.9945,37.7032,37.3556,37.6359,37.5412,37.5296,37.8829,38.3797,38.4452,39.0929,39.1233,39.3014,39.0317,38.903,38.8221,39.045,38.6944,39.0699,39.0978,38.9877,38.8123,38.7491,38.5888,38.7875,38.2086,37.7484,37.3961,36.8663,36.2607,35.8838,35.3297,35.5574,35.7239 ] )

min( ( ( ( arr[i:i+len(pattern)] - pattern ) ** 2 ).mean(), i ) for i in xrange(len(arr)-len(pattern)) )

Out[5]:
(0.46259331934782588, 17)

where 0.46 is the minimal mean squared error, and 17 is the position of the minimum in arr.

Upvotes: 0

4pie0

Reputation: 29724

The straightforward algorithm is to chose a measure of convergence ( how you describe similarity, this might be average of errors, or their squared values or any other function suitable for your purposes) and apply steps

Let i = 0 be an integer index, M be a container of size = length(data) - length(pattern) + 1 to store the measurings
If i < size then shift your pattern by i, otherwise go to step 5
Calculate the measure of similarity and store into M
i = i + 1, go to 2 and repeat
Chose the index of minimum value in M

Upvotes: 0

find best match numeric sequence in sequence array

Answers (3)

Related Questions