Googme
Googme

Reputation: 914

How to simulate data properly?

Hi I'm new to R and would like to ask a more general question. How do I simulate or create an example data set which is suitable to be posted here and simultaneously posses the property of reproducibility. I would like, for instance, create a numeric example which abstract my data set properly. One condition woud be to implement some correlation between my dependent and independent variables. For instance. how to introduce some correlation between my count and my in.var1 and in.var2?

set.seed(1122)  
count<-rpois(1000,30)  
in.var1<- rnorm(1000, mean = 25, sd = 3)
in.var1<- rnorm(1000, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)

Upvotes: 0

Views: 169

Answers (3)

IRTFM
IRTFM

Reputation: 263481

You can introduce dependence by adding in some portion of the "information" in the two variables to the construction of the count variable:

     set.seed(1222)  
                in.var1<- rnorm(1000, mean = 25, sd = 3)
      #Corrected spelling of in.var2
                in.var2<- rnorm(1000, mean = 12, sd = 2)
    count<-rpois(1000,30) + 0.15*in.var1 + 0.3*in.var2
    # Avoid use 'data` as an object name
    dat<-data.frame(count,in.var1,in.var2)

> spearman(count, in.var1)
       rho 
0.06859676 
> spearman(count, in.var2)
      rho 
0.1276568 
> spearman(in.var1, in.var2)
        rho 
-0.02175273 

> summary( glm(count ~ in.var1 + in.var2, data=dat) )

Call:
glm(formula = count ~ in.var1 + in.var2, data = dat)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-16.6816   -3.6910   -0.4238    3.4435   15.5326  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.05034    1.74084  16.688  < 2e-16 ***
in.var1      0.14701    0.05613   2.619  0.00895 ** 
in.var2      0.35512    0.08228   4.316 1.74e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Upvotes: 3

John Paul
John Paul

Reputation: 12704

If you want count to be a function of in.var1 and invar.2 try this. Note that count is already a function name so I am changing it to Count

set.seed(1122)
in.var1<- rnorm(1000, mean = 4, sd = 3)
in.var2<- rnorm(1000, mean = 6, sd = 2)
Count<-rpois(1000, exp(3+ 0.5*in.var1 - 0.25*in.var2))
Data<-data.frame(Count=Count, Var1=in.var1, Var2=in.var2)

You now have a poisson count based on in.var1 and in.var2. A poisson regression will show an intercept of 3 and coefficients of 0.5 for Var1 and -0.25 for Var2

 summary(glm(Count~Var1+Var2,data=Data, family=poisson))

Call:
glm(formula = Count ~ Var1 + Var2, family = poisson, data = Data)

 Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
   -2.84702  -0.76292  -0.04463   0.67525   2.79537  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.001390   0.011782   254.7   <2e-16 ***
Var1         0.499789   0.001004   498.0   <2e-16 ***
Var2        -0.250949   0.001443  -173.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 308190.7  on 999  degrees of freedom
Residual deviance:   1063.3  on 997  degrees of freedom
AIC: 6319.2

Number of Fisher Scoring iterations: 4

Upvotes: 1

Atilla Ozgur
Atilla Ozgur

Reputation: 14721

As I understand you want to add some pattern to your data.

# Basic info taken from Data Science Exploratory Analysis Course
# http://datasciencespecialization.github.io/courses/04_ExploratoryAnalysis/

set.seed(1122)  

rowNumber = 1000

count<-rpois(rowNumber,30)  
in.var1<- rnorm(rowNumber, mean = 25, sd = 3)
in.var2<- rnorm(rowNumber, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)


dataNew <- data



for (i in 1:rowNumber) {
  # flip a coin
  coinFlip <- rbinom(1, size = 1, prob = 0.5)
  # if coin is heads add a common pattern to that row
  if (coinFlip) {
    dataNew[i,"count"] <- 2 * data[i,"in.var1"] + 10*   data[i,"in.var2"]
  }
}

Basically, I am adding a pattern count = 2 *in.var1 + 10 * in.var2 to some random rows, here coinFlip variable. Of course you should vectorize it for more rows.

Upvotes: 0

Related Questions