Reputation: 5527
I have a dataset where a process is described as a time series made of ~2000 points and 1500 dimensions.
I would like to quantify how much each dimension is correlated with another time series measured by another method.
What is the appropriate way to do this (eventually done in python) ? I have heard that Pearson is not well suited for this task, at least without data preparation. What are your thoughts about that?
Many thanks!
Upvotes: 1
Views: 2017
Reputation: 5409
A general good rule in data science is to first try the easy thing. Only when the easy thing fails should you move to something more complicated. With that in mind, here is how you would compute the Pearson correlation between each dimension and some other time series. The key function here being pearsonr
:
import numpy as np
from scipy.stats import pearsonr
# Generate a random dataset using 2000 points and 1500 dimensions
n_times = 2000
n_dimensions = 1500
data = np.random.rand(n_times, n_dimensions)
# Generate another time series, also using 2000 points
other_time_series = np.random.rand(n_times)
# Compute correlation between each dimension and the other time series
correlations = np.zeros(n_dimensions)
for dimension in range(n_dimensions):
# The Pearson correlation function gives us both the correlation
# coefficient (r) and a p-value (p). Here, we only use the coefficient.
r, p = pearsonr(data[:, dimension], other_time_series)
correlations[dimension] = r
# Now we have, for each dimension, the Pearson correlation with the other time
# series!
len(correlations)
# Print the first 5 correlation coefficients
print(correlations[:5])
If Pearson correlation doesn't work well for you, you can try swapping out the pearsonr
function with something else, like:
spearmanr
Spearman rank-order correlation coefficient.kendalltau
Kendall’s tau, a correlation measure for ordinal data.Upvotes: 3