user4999547
user4999547

Reputation:

Find the position of a lowest difference between numpy arrays

I've got two musical files: one lossless with little sound gap (at this time it's just silence but it could be anything: sinusoid or just some noise) at the beginning and one mp3:

In [1]: plt.plot(y[:100000])
Out[1]: 

Lossless file

In [2]: plt.plot(y2[:100000])
Out[2]: 

mp3 file

This lists are similar but not identical so I need to cut this gap, to find the first occurrence of one list in another with lowest delta error.

And here's my solution (5.7065 sec.):

error = []
for i in range(25000):
    y_n = y[i:100000]
    y2_n = y2[:100000-i]
    error.append(abs(y_n - y2_n).mean())
start = np.array(error).argmin()

print(start, error[start]) #23057 0.0100046

Is there any pythonic way to solve this?

Edit: After calculating the mean distance between special points (e.g. where data == 0.5) I reduce the area of search from 25000 to 2000. This gives me reasonable time of 0.3871s:

a = np.where(y[:100000].round(1) == 0.5)[0]
b = np.where(y2[:100000].round(1) == 0.5)[0]

mean = int((a - b[:len(a)]).mean())
delta = 1000

error = []
for i in range(mean - delta, mean + delta):
...

Upvotes: 1

Views: 657

Answers (3)

koffein
koffein

Reputation: 1882

I think what you are looking for is correlation. Here is a small example.

import numpy as np

equal_part = [0, 1, 2, 3, -2, -4, 5, 0]
y1 = equal_part + [0, 1, 2, 3, -2, -4, 5, 0]
y2 = [1, 2, 4, -3, -2, -1, 3, 2]+y1

np.argmax(np.correlate(y1, y2, 'same'))

Out:

7

So this returns the time-difference, where the correlation between both signals is at its maximum. As you can see, in the example the time difference should be 8, but this depends on your data... Also note that both signals have the same length.

Upvotes: 0

Finwood
Finwood

Reputation: 3981

What you are trying to do is a cross-correlation of the two signals.

This can be done easily using signal.correlate from the scipy library:

import scipy.signal
import numpy as np

# limit your signal length to speed things up
lim = 25000

# do the actual correlation
corr = scipy.signal.correlate(y[:lim], y2[:lim], mode='full')

# The offset is the maximum of your correlation array,
# itself being offset by (lim - 1):
offset = np.argmax(corr) - (lim - 1)

You might want to take a look at this answer to a similar problem.

Upvotes: 2

Boris Gorelik
Boris Gorelik

Reputation: 31767

Let's generate some data first

N = 1000
y1 = np.random.randn(N)
y2 = y1 + np.random.randn(N) * 0.05
y2[0:int(N / 10)] = 0

In these data, y1 and y2 are almost the same (note the small added noise), but the first 10% of y2 are empty (similarly to your example)

We can now calculate the absolute difference between the two vectors and find the first element for which the absolute difference is below a sensitivity threshold:

abs_delta = np.abs(y1 - y2)
THRESHOLD = 1e-2
sel = abs_delta < THRESHOLD
ix_start = np.where(sel)[0][0]


fig, axes = plt.subplots(3, 1)
ax = axes[0]
ax.plot(y1, '-')
ax.set_title('y1')
ax.axvline(ix_start, color='red')
ax = axes[1]
ax.plot(y2, '-')
ax.axvline(ix_start, color='red')
ax.set_title('y2')

ax = axes[2]
ax.plot(abs_delta)
ax.axvline(ix_start, color='red')
ax.set_title('abs diff')

sample data plotted

This method works if the overlapping parts are indeed "almost identical". You will have to think of smarter alignment ways if the similarity is low.

Upvotes: 0

Related Questions