Reputation: 914
Hi I'm new to R and would like to ask a more general question. How do I simulate or create an example data set which is suitable to be posted here and simultaneously posses the property of reproducibility. I would like, for instance, create a numeric example which abstract my data set properly. One condition woud be to implement some correlation between my dependent and independent variables.
For instance. how to introduce some correlation between my count and my in.var1
and in.var2
?
set.seed(1122)
count<-rpois(1000,30)
in.var1<- rnorm(1000, mean = 25, sd = 3)
in.var1<- rnorm(1000, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
Upvotes: 0
Views: 169
Reputation: 263481
You can introduce dependence by adding in some portion of the "information" in the two variables to the construction of the count variable:
set.seed(1222)
in.var1<- rnorm(1000, mean = 25, sd = 3)
#Corrected spelling of in.var2
in.var2<- rnorm(1000, mean = 12, sd = 2)
count<-rpois(1000,30) + 0.15*in.var1 + 0.3*in.var2
# Avoid use 'data` as an object name
dat<-data.frame(count,in.var1,in.var2)
> spearman(count, in.var1)
rho
0.06859676
> spearman(count, in.var2)
rho
0.1276568
> spearman(in.var1, in.var2)
rho
-0.02175273
> summary( glm(count ~ in.var1 + in.var2, data=dat) )
Call:
glm(formula = count ~ in.var1 + in.var2, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-16.6816 -3.6910 -0.4238 3.4435 15.5326
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.05034 1.74084 16.688 < 2e-16 ***
in.var1 0.14701 0.05613 2.619 0.00895 **
in.var2 0.35512 0.08228 4.316 1.74e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Upvotes: 3
Reputation: 12704
If you want count
to be a function of in.var1
and invar.2
try this. Note that count
is already a function name so I am changing it to Count
set.seed(1122)
in.var1<- rnorm(1000, mean = 4, sd = 3)
in.var2<- rnorm(1000, mean = 6, sd = 2)
Count<-rpois(1000, exp(3+ 0.5*in.var1 - 0.25*in.var2))
Data<-data.frame(Count=Count, Var1=in.var1, Var2=in.var2)
You now have a poisson count based on in.var1
and in.var2
. A poisson regression will show an intercept of 3 and coefficients of 0.5 for Var1
and -0.25 for Var2
summary(glm(Count~Var1+Var2,data=Data, family=poisson))
Call:
glm(formula = Count ~ Var1 + Var2, family = poisson, data = Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.84702 -0.76292 -0.04463 0.67525 2.79537
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.001390 0.011782 254.7 <2e-16 ***
Var1 0.499789 0.001004 498.0 <2e-16 ***
Var2 -0.250949 0.001443 -173.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 308190.7 on 999 degrees of freedom
Residual deviance: 1063.3 on 997 degrees of freedom
AIC: 6319.2
Number of Fisher Scoring iterations: 4
Upvotes: 1
Reputation: 14721
As I understand you want to add some pattern to your data.
# Basic info taken from Data Science Exploratory Analysis Course
# http://datasciencespecialization.github.io/courses/04_ExploratoryAnalysis/
set.seed(1122)
rowNumber = 1000
count<-rpois(rowNumber,30)
in.var1<- rnorm(rowNumber, mean = 25, sd = 3)
in.var2<- rnorm(rowNumber, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
dataNew <- data
for (i in 1:rowNumber) {
# flip a coin
coinFlip <- rbinom(1, size = 1, prob = 0.5)
# if coin is heads add a common pattern to that row
if (coinFlip) {
dataNew[i,"count"] <- 2 * data[i,"in.var1"] + 10* data[i,"in.var2"]
}
}
Basically, I am adding a pattern count = 2 *in.var1 + 10 * in.var2 to some random rows, here coinFlip variable. Of course you should vectorize it for more rows.
Upvotes: 0