Reputation: 99
This is my first attempt at simulating data - we'd like to simulate a dataset and have elected to use simstudy using the following code:
def <- defData(varname='median_household_income',formula=mean(
df$median_household_income))
def <- defData(def, varname='share_unemployed_seasonal',formula=mean(
df$share_unemployed_seasonal))
def <- defData(def, varname='share_population_in_metro_areas',
formula=mean(df$share_population_in_metro_areas))
def <- defData(def, varname='share_population_with_high_school_degree',
formula=mean(df$share_population_with_high_school_degree))
def <- defData(def, varname='share_non_citizen',
formula=mean(df$share_non_citizen))
def <- defData(def, varname='share_white_poverty',
formula=mean(df$share_white_poverty))
def <- defData(def, varname='gini_index',formula=mean(df$gini_index))
def <- defData(def, varname='share_non_white',formula=mean(df$share_non_white))
def <- defData(def, varname='share_voters_voted_trump',
formula=mean(df$share_voters_voted_trump))
#outcome
def <- defData(def, varname='avg_hatecrimes_per_100k_fbi',formula=
".0001*median_household_income + 44*share_unemployed_seasonal +
-2.8*share_population_in_metro_areas +
24*share_population_with_high_school_degree + 22*share_non_citizen +
3.2*share_white_poverty + 55*gini_index + -4*share_non_white +
-2.6*share_voters_voted_trump")
#generate simulated data
df_sim <- genData(10000,def)
The output looks like this:
head(df_sim)
id median_household_income share_unemployed_seasonal share_population_in_metro_areas
1: 1 55223.61 0.04956863 0.7501961
2: 2 55223.61 0.04956863 0.7501961
3: 3 55223.61 0.04956863 0.7501961
4: 4 55223.61 0.04956863 0.7501961
5: 5 55223.61 0.04956863 0.7501961
6: 6 55223.61 0.04956863 0.7501961
Why are all the generated values identicl? My understanding is that the variables are generated according to a normal distribution by default. Any help with this is appreciated!
Upvotes: 0
Views: 131
Reputation: 2644
I found that you are referring to a package simstudy
. If you check the documentation for defData
function (link here), you will find out that there is variance
parameter to the defData
function which defaults to zero. If you want to have non-identical observations, you need to set this value to a number larger than 0.
The default behavior of defData
function:
defData(dtDefs = NULL, varname, formula, variance = 0,
dist = "normal", link = "identity", id = "id")
So you might want to run a command like
def <- defData(varname='median_household_income',
formula=mean(df$median_household_income),
variance = 1)
Upvotes: 1