Reputation: 21
The problem:
I want to be able to sample corelated data, defined by a given correlation matrix, from multiple distributions. Regarding the distributions I want to make as little assumptions as possible. It should especially work for non-continuous distributions and continuous distributions alike. (Mean 0 and Variance 1 can of course be assumed) The correlations do not need to be the exact correlations, it is sufficient to have a decent approximation. I do not know the joint distribution, but if one could be created given the single distributions and the correlation matrix, this should solve my problem.
What I tried:
I tried to draw from the distributions first and try to find rearrangements that matched the desired correlations better and better. My starting point was the current best solution and the I shuffled a decreasingly small part of the samples and checked if the result was a better match. If so, this became my new starting point. But the results were rather disappointing. So there are probably better ways to do it.
My code (not providing the desired results):
import numpy as np
import scipy
import random
#the desired correlation matrix
des_cov1=np.array([[1,0.5,-0.2],[0.5,1,0.2],[-0.2,0.2,1]])
#data drawn from 3 different distributions
data1=np.array([np.random.normal(loc=-3.0, scale=1.0, size=1000), np.random.binomial(1000, 0.2, size=1000),np.random.normal(loc=5.0, scale=7.0, size=1000)])
#A helper function foudn on stackoverflow to shuffle part of the data
#shuffle a given percentage of the data (on average)
def pashuffle(data, perc=10):
for index, letter in enumerate(data):
if random.randrange(0, 100) < perc/2:
new_index = random.randrange(0, len(data))
data[index], data[new_index] = data[new_index], data[index]
return data
#Try to correlate the data accoridng to a desired covariation matrix
#shuffle a given percentage of the data (on average)
def pashuffle(data, perc=10):
for index, letter in enumerate(data):
if random.randrange(0, 100) < perc/2:
new_index = random.randrange(0, len(data))
data[index], data[new_index] = data[new_index], data[index]
return data
#Try to correlate the data accoridng to a desired covariation matrix
def correlate_data_sets(data, des_cov, nr_trials=100, perc=10, decrease_factor=0.999):
data=np.array(data, dtype=float)
#save means and variance to reconsturct the old data
means= np.mean(data,axis=1)
standard_deviation= scipy.stats.tstd(data, axis=1)
for i in range(len(data)):
bd_mean=means[i]
bd_std=standard_deviation[i]
#normalize data
data[i]=[(j-bd_mean)/bd_std for j in data[i]]
best_data=np.array(data)
#initialize best deviation with a value that will for sure be beat
best_deviation=4*len(des_cov)*len(des_cov)
#try to find a good solution by starting with the current best
for i in range(nr_trials):
#shuffle a part of the data, to get a posibly better solution
for j in range(len(data)-1):
data[j]=pashuffle(best_data[j],perc)
#calulate the squared deviation
deviation_matrix=np.cov(data)-des_cov
squared_cov_deviation=sum(sum(deviation_matrix*deviation_matrix))
#replace current data arrangement if new arragnement is better
if squared_cov_deviation < best_deviation:
best_data=data.copy()
best_deviation=squared_cov_deviation
#decrease the amount of data to shuffle
perc*=decrease_factor
correlations=np.cov(best_data)
#transform data back
for i in range(len(data)):
bd_mean=means[i]
bd_std=standard_deviation[i]
best_data[i]=[(j*bd_std+bd_mean) for j in best_data[i]]
print(correlations)
return(best_data)
#Apply the function on the data
correlate_data_sets(data=data1, des_cov=des_cov1)
Upvotes: 1
Views: 1077
Reputation: 18625
This is a tricky problem, but you can do it by (1) find the spearman rank correlation you need, (2) generate values from a uniform distribution with this pair-wise correlation, then (3) use the values from this sample as ranks in your arbitrary distributions, to generate values from those distributions. See my paper using this technique at http://ee.hawaii.edu/~mfripp/papers/Fripp_2011_Wind_Reserves.pdf (section 2.2).
If you need more than the right pair-wise rank correlation, you may be able to do it by generating uniformly distributed tuples (one element for each random variable), then using some technique to nudge them into the right correlation structure, then use them as ranks for the arbitrary distributions. That is in the area of copula methods.
Upvotes: 1