MeC
MeC

Reputation: 463

loop over dataframe to simulate normal data distrbution

I would like to loop over a dataframe that contains the parameters for a data simulation. Ideally, I could avoid writing a for loop for this and do it in the tidyverse, but I haven't found a solution that works yet.

Consider a dataframe with parameters:

grouping1 <- c('a','a', 'a', 'b', 'b', 'b')
grouping2 <- c('A','A', 'B', 'B', 'C', 'C')
grouping3 <- c('1','2', '3', '4', '5', '6')
observations <- c(14, 14, 12, 12, 15, 15)
average <- c(334, 336, 243, 645, 233, 625)
variance <- c(2, 6, 7, 9, 2, 6)
my_data <- cbind(grouping1,grouping2,grouping3,observations,average,variance)

And a simple pipe to simulate values on the basis of those parameters:

my_generated_data <- my_data %>%
  group_by(grouping1,grouping2,grouping3) %>%
  rnorm(n=observations, mean=average, sd=variance) 

But this does not work. For one thing, I get an error about an unused '.' argument, but the following doesn't work either:

my_generated_data <- my_data %>%
  group_by(grouping1,grouping2,grouping3) %>%
  rnorm(n=.$observations, mean=.$average, sd=.$variance) 

Another issue is that the number of generated observations differs by the grouping level (e.g. 12, 14, or 15). This shouldn't be a major issue, but it does mean the generated dataframe will have to be long, not wide given the uneven # of rows. Thank you in advance for the help.

Upvotes: 0

Views: 47

Answers (2)

Martin Gal
Martin Gal

Reputation: 16978

Joans already did answer this question, but I want to add a solution using tidyverse.

First of all, with R >= 4.0, you don't need the stringsAsFactors argument when defining data.frames. The definition of my_data is simply

my_data <- data.frame(grouping1,grouping2,grouping3,observations,average,variance)

Now we can use

library(dplyr)

my_generated_data <- my_data %>% 
  group_by(grouping1, grouping2, grouping3) %>% 
  mutate(sim = list(rnorm(n = observations, mean = average, sd = sqrt(variance))))

to get

# Groups:   grouping1, grouping2, grouping3 [6]
  grouping1 grouping2 grouping3 observations average variance sim       
  <chr>     <chr>     <chr>            <dbl>   <dbl>    <dbl> <list>    
1 a         A         1                   14     334        2 <dbl [14]>
2 a         A         2                   14     336        6 <dbl [14]>
3 a         B         3                   12     243        7 <dbl [12]>
4 b         B         4                   12     645        9 <dbl [12]>
5 b         C         5                   15     233        2 <dbl [15]>
6 b         C         6                   15     625        6 <dbl [15]>

where column sim contains a list of simulated data based on the observations, average and variance in the same row. You could now either extract this list using for example

my_generated_list[[1, "sim]]
#> [[1]]
#> [1] 333.9635 335.0959 334.2201 335.6582 335.0773 335.6701 331.9570 334.0041 332.9627 333.5582 335.6228 334.4168 330.4192
#> [14] 335.2726

or unnest it

my_data %>% 
  group_by(grouping1, grouping2, grouping3) %>% 
  mutate(sim = list(rnorm(n = observations, mean = average, sd = sqrt(variance)))) %>% 
  unnest_wider(sim) # use unnest(sim) or unnest_longer(sim) for a "long" format

returning

# A tibble: 6 x 21
# Groups:   grouping1, grouping2, grouping3 [6]
  grouping1 grouping2 grouping3 observations average variance  ...1  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 ...10
  <chr>     <chr>     <chr>            <dbl>   <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a         A         1                   14     334        2  334.  333.  337.  335.  332.  334.  335.  334.  335.  333.
2 a         A         2                   14     336        6  338.  336.  333.  334.  334.  336.  333.  339.  336.  335.
3 a         B         3                   12     243        7  243.  244.  243.  241.  241.  250.  243.  239.  243.  240.
4 b         B         4                   12     645        9  645.  645.  647.  641.  648.  639.  650.  647.  643.  641.
5 b         C         5                   15     233        2  232.  234.  235.  237.  233.  232.  235.  231.  233.  236.
6 b         C         6                   15     625        6  621.  625.  632.  625.  626.  626.  623.  620.  627.  630.
# ... with 5 more variables: ...11 <dbl>, ...12 <dbl>, ...13 <dbl>, ...14 <dbl>, ...15 <dbl>

Upvotes: 2

Jonas
Jonas

Reputation: 1810

The first problem is that you are combining vectors to a matrix. The type of the resulting matrix is character since at least one vector is character. The type you need to store the vectors retaining their types is a data.frame, like

my_data <- data.frame(grouping1 = grouping1,
                      grouping2 = grouping2,
                      grouping3 = grouping3,
                      observations = observations,
                      average = average,
                      variance = variance, 
                      stringsAsFactors = FALSE)

Now, you can loop over the rows of the dataframe and simulate your data. Since the length of the simulation depends on the observation-column like you mentioned, create a list of observations:

simulationList <- lapply(1:NROW(my_data), function(k) {
  rnorm(n = my_data$observations[k], mean = my_data$average[k], sd = sqrt(my_data$variance[k])) 
})

You now want to add the simulations to your dataframe. Whether this is a good idea, is your part. But you could achieve this by expanding (replicating) your dataframe to a fitting length and add the simulations like

my_data <- my_data[rep(1:NROW(my_data), times = my_data$observations),]
my_data$simulation <- unlist(simulationList)

Upvotes: 2

Related Questions