Reputation: 475
I got two arrays of data that I want to cross correlate, and get the lenght of the delay (if there is) between the two arrays, and then normalize it between 0 and 1. For example:
import numpy as np
x = [0,1,1,1,2,0,0]
y = [0,0,0,1,1,1,2]
corr = np.correlate(a,b, 'full')
norm = np.linalg.norm
normalized = corr/(norm(a)*norm(b))
returns:
[0.0, 0.0, 0.29, 0.43, 0.57, 1.0, 0.57, 0.43, 0.29, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
The problem is: I need to correlate two graphics, and the X array is not regular (and not the same for the two arrays, there are just some y value linked to some x value), so I interpolate the data before the correlation withscipy.interpolate.interp1d
and it results in NaN entries in my array.
At this point the correlation function only returns NaN
For example:
import numpy as np
x = [0,1,1,1,2,0,np.nan]
y = [np.nan,0,0,1,1,1,2]
corr = np.correlate(a,b, 'full')
norm = np.linalg.norm
normalized = corr/(norm(a)*norm(b))
returns:
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]
I finally understood that i get this because norm(a) results NaN, my question is: How can I just ignore those NaN values, is there a better way to cross correlate two arrays ?
I already tested to run interp1d
with fill_value='extrapolate'
but it causes problems in the correlation calculation. Is there another value that i can pass to fill_value that will "ignore" the missing values in the data?
Also, np.correlate(x,y)
returns NaN
but if we look at np.correlate(x,y,'full')
it actually return [ 0. 0. 2. 3. 4. 7. 4. nan nan nan nan nan nan nan nan]
, Why is numpy taking NaN as maximum value ?
Upvotes: 2
Views: 3454
Reputation: 3158
First of all, replace NAN values with perhaps mean or mode of rest of the elements. The is the most naive technique. How to work with NAN can be an entirely different question. You can use np.nanmean()
for that purpose.
Numpy's correlate is not what you are looking for.
From documentation:
Cross-correlation of two 1-dimensional sequences.
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
You should rather look at Pearson correlation coefficient, which is a measure of the linear correlation between two variables X and Y.
from scipy.stats.stats import pearsonr
x = [0,1,1,1,2,0,np.nan]
y = [np.nan,0,0,1,1,1,2]
corr = pearsonr(x,y, 'full')
or you might also use
numpy.corrcoef(x,y)
which returns a 2d array explaining correference between two (or more) arrays.
Upvotes: 1