Hendrik
Hendrik

Reputation: 321

Simulate data for repeated binary measures

I can generate a binary variable y as follows:

clear
set more off
gen y =.
replace y = rbinomial(1, .5)

How can I generate n variables y_1, y_2, ..., y_n with a correlation of rho?

Upvotes: 0

Views: 536

Answers (3)

user8682794
user8682794

Reputation:

This is @pjs's solution in Stata for generating a time-series:

clear
set seed 12345
set obs 1

local p = 0.7
local rho = 0.5

generate y = runiform()
if y <= `p' replace y = 1
else replace y = 0

forvalues i = 1 / 99999 {
    set obs `= _N + 1'
    local rnd = runiform()
    if y[`i'] == 1 {
        if `rnd' <= `p' + `rho' * (1 - `p') replace y = 1 in `= `i' + 1'
        else replace y = 0 in `= `i' + 1' 
    }
    else {
        if `rnd' <= `p' * (1 - `rho') replace y = 1 in `= `i' + 1'
        else replace y = 0 in `= `i' + 1' 
    }
}

Results:

summarize y

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           y |    100,000      .70078    .4579186          0          1


generate id = _n
tsset id
corrgram y, lags(5)

                                          -1       0       1 -1       0       1
 LAG       AC       PAC      Q     Prob>Q  [Autocorrelation]  [Partial Autocor]
-------------------------------------------------------------------------------
1        0.5036   0.5036    25366  0.0000          |----              |----    
2        0.2567   0.0041    31955  0.0000          |--                |        
3        0.1273  -0.0047    33576  0.0000          |-                 |        
4        0.0572  -0.0080    33903  0.0000          |                  |        
5        0.0277   0.0032    33980  0.0000          |                  |        

Upvotes: 2

user8682794
user8682794

Reputation:

This is @pjs's solution in Stata for generating pairs of variables:

clear
set obs 100
set seed 12345

generate x = rbinomial(1, 0.7)

generate y = rbinomial(1, 0.7 + 0.2 * (1 - 0.7)) if x == 1
replace y  = rbinomial(1, 0.7 * (1 - 0.2)) if x != 1

summarize x y

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           x |        100         .72    .4512609          0          1
           y |        100         .67    .4725816          0          1

correlate x y
(obs=100)

             |        x        y
-------------+------------------
           x |   1.0000
           y |   0.1781   1.0000

And a simulation:

set seed 12345

tempname sim1
tempfile mcresults

postfile `sim1' mu_x mu_y rho using `mcresults', replace

forvalues i = 1 / 100000 {
    quietly {
        clear
        set obs 100

        generate x = rbinomial(1, 0.7)
        generate y = rbinomial(1, 0.7 + 0.2 * (1 - 0.7)) if x == 1
        replace  y = rbinomial(1, 0.7 * (1 - 0.2)) if x != 1

        summarize x, meanonly
        scalar mean_x = r(mean)

        summarize y, meanonly
        scalar mean_y = r(mean) 

        corr x y
        scalar rho = r(rho)

        post `sim1'  (mean_x) (mean_y) (rho)
    }
}

postclose `sim1'

use `mcresults', clear

summarize *

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        mu_x |    100,000    .7000379    .0459078        .47        .89
        mu_y |    100,000    .6999094    .0456385        .49        .88
         rho |    100,000    .1993097    .1042207  -.2578483   .6294388

Note that in this example I use p = 0.7 and rho = 0.2 instead.

Upvotes: 2

pjs
pjs

Reputation: 19855

Correlation is a pairwise measure, so I'm assuming that when you talk about binary (Bernoulli) values Y1,...,Yn having a correlation of rho you're viewing them as a time series Yi: i = 1,...,n, of Bernoulli values having a common mean p, variance p*(1-p), and a lag 1 correlation of rho.

I was able to work it out using the definition of correlation and conditional probability. Given it was a bunch of tedious algebra and stackoverflow doesn't do math gracefully, I'm jumping straight to the result, expressed in pseudocode:

if Y[i] == 1:
   generate Y[i+1] as Bernoulli(p + rho * (1 - p))
else:
   generate Y[i+1] as Bernoulli(p * (1 - rho))

As a sanity check you can see that if rho = 0 it just generates Bernoulli(p)'s, regardless of the prior value. As you already noted in your question, Bernoulli RVs are binomials with n = 1.

This works for all 0 <= rho, p <= 1. For negative correlations, there are constraints on the relative magnitudes of p and rho so that the parameters of the Bernoullis are always between 0 and 1.

You can analytically check the conditional probabilities to confirm correctness. I don't use Stata, but I tested this pretty thoroughly in the JMP statistical software and it works like a charm.


IMPLEMENTATION (Python)

import random

def Bernoulli(p):
    return 1 if random.random() <= p else 0 # yields 1 w/ prob p, 0 otherwise

N = 100000
p = 0.7
rho = 0.5
last_y = Bernoulli(p)
for _ in range(N):
    if last_y == 1:
        last_y = Bernoulli(p + rho * (1 - p))
    else:
        last_y =  Bernoulli(p * (1 - rho))
    print(last_y)

I ran this and redirected the results to a file, then imported the file into JMP. Analyzing it as a time series produced:

See description below

The sample mean was 0.69834, with a standard deviation of 0.4589785 [upper right of the figure]. The lag-1 estimates for autocorrelation and partial correlation are 0.5011 [bottom left and right, respectively]. These estimated values are all excellent matches to a Bernoulli(0.7) with rho = 0.5, as specified in the demo program.


If the goal is instead to produce (X,Y) pairs with the specified correlation, revise the loop to:

for _ in range(N):
    x = Bernoulli(p)
    if x == 1:
        y = Bernoulli(p + rho * (1 - p))
    else:
        y = Bernoulli(p * (1 - rho))
    print(x, y)

Upvotes: 1

Related Questions