Reputation: 321
I can generate a binary variable y
as follows:
clear
set more off
gen y =.
replace y = rbinomial(1, .5)
How can I generate n
variables y_1, y_2, ..., y_n
with a correlation of rho
?
Upvotes: 0
Views: 536
Reputation:
This is @pjs's solution in Stata for generating a time-series:
clear
set seed 12345
set obs 1
local p = 0.7
local rho = 0.5
generate y = runiform()
if y <= `p' replace y = 1
else replace y = 0
forvalues i = 1 / 99999 {
set obs `= _N + 1'
local rnd = runiform()
if y[`i'] == 1 {
if `rnd' <= `p' + `rho' * (1 - `p') replace y = 1 in `= `i' + 1'
else replace y = 0 in `= `i' + 1'
}
else {
if `rnd' <= `p' * (1 - `rho') replace y = 1 in `= `i' + 1'
else replace y = 0 in `= `i' + 1'
}
}
Results:
summarize y
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
y | 100,000 .70078 .4579186 0 1
generate id = _n
tsset id
corrgram y, lags(5)
-1 0 1 -1 0 1
LAG AC PAC Q Prob>Q [Autocorrelation] [Partial Autocor]
-------------------------------------------------------------------------------
1 0.5036 0.5036 25366 0.0000 |---- |----
2 0.2567 0.0041 31955 0.0000 |-- |
3 0.1273 -0.0047 33576 0.0000 |- |
4 0.0572 -0.0080 33903 0.0000 | |
5 0.0277 0.0032 33980 0.0000 | |
Upvotes: 2
Reputation:
This is @pjs's solution in Stata for generating pairs of variables:
clear
set obs 100
set seed 12345
generate x = rbinomial(1, 0.7)
generate y = rbinomial(1, 0.7 + 0.2 * (1 - 0.7)) if x == 1
replace y = rbinomial(1, 0.7 * (1 - 0.2)) if x != 1
summarize x y
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
x | 100 .72 .4512609 0 1
y | 100 .67 .4725816 0 1
correlate x y
(obs=100)
| x y
-------------+------------------
x | 1.0000
y | 0.1781 1.0000
And a simulation:
set seed 12345
tempname sim1
tempfile mcresults
postfile `sim1' mu_x mu_y rho using `mcresults', replace
forvalues i = 1 / 100000 {
quietly {
clear
set obs 100
generate x = rbinomial(1, 0.7)
generate y = rbinomial(1, 0.7 + 0.2 * (1 - 0.7)) if x == 1
replace y = rbinomial(1, 0.7 * (1 - 0.2)) if x != 1
summarize x, meanonly
scalar mean_x = r(mean)
summarize y, meanonly
scalar mean_y = r(mean)
corr x y
scalar rho = r(rho)
post `sim1' (mean_x) (mean_y) (rho)
}
}
postclose `sim1'
use `mcresults', clear
summarize *
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
mu_x | 100,000 .7000379 .0459078 .47 .89
mu_y | 100,000 .6999094 .0456385 .49 .88
rho | 100,000 .1993097 .1042207 -.2578483 .6294388
Note that in this example I use p = 0.7
and rho = 0.2
instead.
Upvotes: 2
Reputation: 19855
Correlation is a pairwise measure, so I'm assuming that when you talk about binary (Bernoulli) values Y1,...,Yn having a correlation of rho
you're viewing them as a time series Yi: i = 1,...,n, of Bernoulli values having a common mean p
, variance p*(1-p)
, and a lag 1 correlation of rho
.
I was able to work it out using the definition of correlation and conditional probability. Given it was a bunch of tedious algebra and stackoverflow doesn't do math gracefully, I'm jumping straight to the result, expressed in pseudocode:
if Y[i] == 1:
generate Y[i+1] as Bernoulli(p + rho * (1 - p))
else:
generate Y[i+1] as Bernoulli(p * (1 - rho))
As a sanity check you can see that if rho = 0
it just generates Bernoulli(p)'s, regardless of the prior value. As you already noted in your question, Bernoulli RVs are binomials with n = 1
.
This works for all 0 <= rho, p <= 1
. For negative correlations, there are constraints on the relative magnitudes of p
and rho
so that the parameters of the Bernoullis are always between 0 and 1.
You can analytically check the conditional probabilities to confirm correctness. I don't use Stata, but I tested this pretty thoroughly in the JMP statistical software and it works like a charm.
IMPLEMENTATION (Python)
import random
def Bernoulli(p):
return 1 if random.random() <= p else 0 # yields 1 w/ prob p, 0 otherwise
N = 100000
p = 0.7
rho = 0.5
last_y = Bernoulli(p)
for _ in range(N):
if last_y == 1:
last_y = Bernoulli(p + rho * (1 - p))
else:
last_y = Bernoulli(p * (1 - rho))
print(last_y)
I ran this and redirected the results to a file, then imported the file into JMP. Analyzing it as a time series produced:
The sample mean was 0.69834, with a standard deviation of 0.4589785 [upper right of the figure]. The lag-1 estimates for autocorrelation and partial correlation are 0.5011 [bottom left and right, respectively]. These estimated values are all excellent matches to a Bernoulli(0.7) with rho = 0.5
, as specified in the demo program.
If the goal is instead to produce (X,Y) pairs with the specified correlation, revise the loop to:
for _ in range(N):
x = Bernoulli(p)
if x == 1:
y = Bernoulli(p + rho * (1 - p))
else:
y = Bernoulli(p * (1 - rho))
print(x, y)
Upvotes: 1