Generate simulated data in Python while meeting a range of correlations with respect to a predefined variable

Question

Let's denote refVar, a variable of interest that contains experimental data. For the simulation study, I would like to generate other variables V0.05, V0.10, V0.15 until V0.95. Note that for the variable name, the value following V represents the correlation between the variable and refVar (in order to quick track in the final dataframe). My readings led me to multivariate_normal() from numpy. However, when using this function, it generates 2 1D-arrays both with random numbers. What I want is to always keep refVar and generate other arrays filled with random numbers, while meeting the specified correlation. Please, find below my my code. To cut it short, I've no clue how to generate other variables relative to my experimental variable refVar. Ideally, I would like to build a data frame containing the following columns: refVar,V0.05,V0.10,...,V0.95. I hope you get my point and thank you in advance for your time

import numpy as np
import pandas as pd
from numpy.random import multivariate_normal as mvn

refVar = [75.25,77.93,78.2,61.77,80.88,71.95,79.88,65.53,85.03,61.72,60.96,56.36,23.16,73.36,64.18,83.07,63.25,49.3,78.2,30.96]
mean_refVar = np.mean(refVar)
for r in np.arange(0,1,0.05):
    var1 = 1
    var2 = 1
    cov = r 
    cov_matrix = [[var1,cov],
                 [cov,var2]]
    data = mvn([mean_refVar,mean_refVar],cov_matrix,size=len(refVar))
    output = 'corr_'+str(r.round(2))+'.txt'
    df = pd.DataFrame(data,columns=['refVar','v'+str(r.round(2)])
    df.to_csv(output,sep='	',index=False) # Ideally, instead of creating an output for each correlation, I would like to generate a DF with refVar and all these newly created Series

Quang Hoang · Accepted Answer

Following this answer we can generate the sequence as follow:

def rand_with_corr(refVar, corr):
    # center and normalize refVar
    X = np.array(refVar) - np.mean(refVar)
    X = X/np.linalg.norm(X)

    # random sampling Y
    Y = np.random.rand(len(X))
    # centralize Y
    Y = Y - Y.mean()
    
    # find the orthorgonal component to X
    Y = Y - Y.dot(X) * X
    
    # normalize Y
    Y = Y/np.linalg.norm(Y)


    # output
    return Y + (1/np.tan(np.arccos(corr))) * X

# test
out = rand_with_corr(refVar, 0.05)

pd.Series(out).corr(pd.Series(refVar))
# out
# 0.050000000000000086

Generate simulated data in Python while meeting a range of correlations with respect to a predefined variable

Answers (1)

Related Questions