Alissa
Alissa

Reputation: 99

simstudy: genData producing identical values

This is my first attempt at simulating data - we'd like to simulate a dataset and have elected to use simstudy using the following code:

def <- defData(varname='median_household_income',formula=mean(
               df$median_household_income))
def <- defData(def, varname='share_unemployed_seasonal',formula=mean(
               df$share_unemployed_seasonal))
def <- defData(def, varname='share_population_in_metro_areas',
               formula=mean(df$share_population_in_metro_areas))
def <- defData(def, varname='share_population_with_high_school_degree',
               formula=mean(df$share_population_with_high_school_degree))
def <- defData(def, varname='share_non_citizen',
               formula=mean(df$share_non_citizen))
def <- defData(def, varname='share_white_poverty',
               formula=mean(df$share_white_poverty))
def <- defData(def, varname='gini_index',formula=mean(df$gini_index))
def <- defData(def, varname='share_non_white',formula=mean(df$share_non_white))
def <- defData(def, varname='share_voters_voted_trump',
               formula=mean(df$share_voters_voted_trump))
#outcome
def <- defData(def, varname='avg_hatecrimes_per_100k_fbi',formula=
               ".0001*median_household_income + 44*share_unemployed_seasonal + 
               -2.8*share_population_in_metro_areas +
               24*share_population_with_high_school_degree + 22*share_non_citizen + 
               3.2*share_white_poverty + 55*gini_index + -4*share_non_white + 
               -2.6*share_voters_voted_trump")

#generate simulated data
df_sim <- genData(10000,def)

The output looks like this:

 head(df_sim)
 id median_household_income share_unemployed_seasonal share_population_in_metro_areas
1:  1                55223.61                0.04956863                       0.7501961
2:  2                55223.61                0.04956863                       0.7501961
3:  3                55223.61                0.04956863                       0.7501961
4:  4                55223.61                0.04956863                       0.7501961
5:  5                55223.61                0.04956863                       0.7501961
6:  6                55223.61                0.04956863                       0.7501961

Why are all the generated values identicl? My understanding is that the variables are generated according to a normal distribution by default. Any help with this is appreciated!

Upvotes: 0

Views: 131

Answers (1)

ira
ira

Reputation: 2644

I found that you are referring to a package simstudy. If you check the documentation for defData function (link here), you will find out that there is variance parameter to the defData function which defaults to zero. If you want to have non-identical observations, you need to set this value to a number larger than 0.

The default behavior of defData function:

defData(dtDefs = NULL, varname, formula, variance = 0,
  dist = "normal", link = "identity", id = "id")

So you might want to run a command like

def <- defData(varname='median_household_income',
               formula=mean(df$median_household_income),
               variance = 1)

Upvotes: 1

Related Questions