loop over dataframe to simulate normal data distrbution

Question

I would like to loop over a dataframe that contains the parameters for a data simulation. Ideally, I could avoid writing a for loop for this and do it in the tidyverse, but I haven't found a solution that works yet.

Consider a dataframe with parameters:

grouping1 <- c('a','a', 'a', 'b', 'b', 'b')
grouping2 <- c('A','A', 'B', 'B', 'C', 'C')
grouping3 <- c('1','2', '3', '4', '5', '6')
observations <- c(14, 14, 12, 12, 15, 15)
average <- c(334, 336, 243, 645, 233, 625)
variance <- c(2, 6, 7, 9, 2, 6)
my_data <- cbind(grouping1,grouping2,grouping3,observations,average,variance)

And a simple pipe to simulate values on the basis of those parameters:

my_generated_data <- my_data %>%
  group_by(grouping1,grouping2,grouping3) %>%
  rnorm(n=observations, mean=average, sd=variance)

But this does not work. For one thing, I get an error about an unused '.' argument, but the following doesn't work either:

my_generated_data <- my_data %>%
  group_by(grouping1,grouping2,grouping3) %>%
  rnorm(n=.$observations, mean=.$average, sd=.$variance)

Another issue is that the number of generated observations differs by the grouping level (e.g. 12, 14, or 15). This shouldn't be a major issue, but it does mean the generated dataframe will have to be long, not wide given the uneven # of rows. Thank you in advance for the help.

Martin Gal · Accepted Answer

Joans already did answer this question, but I want to add a solution using tidyverse.

First of all, with R >= 4.0, you don't need the stringsAsFactors argument when defining data.frames. The definition of my_data is simply

my_data <- data.frame(grouping1,grouping2,grouping3,observations,average,variance)

Now we can use

library(dplyr)

my_generated_data <- my_data %>% 
  group_by(grouping1, grouping2, grouping3) %>% 
  mutate(sim = list(rnorm(n = observations, mean = average, sd = sqrt(variance))))

to get

# Groups:   grouping1, grouping2, grouping3 [6]
  grouping1 grouping2 grouping3 observations average variance sim       
                                    
1 a         A         1                   14     334        2 
2 a         A         2                   14     336        6 
3 a         B         3                   12     243        7 
4 b         B         4                   12     645        9 
5 b         C         5                   15     233        2 
6 b         C         6                   15     625        6

where column sim contains a list of simulated data based on the observations, average and variance in the same row. You could now either extract this list using for example

my_generated_list[[1, "sim]]
#> [[1]]
#> [1] 333.9635 335.0959 334.2201 335.6582 335.0773 335.6701 331.9570 334.0041 332.9627 333.5582 335.6228 334.4168 330.4192
#> [14] 335.2726

or unnest it

my_data %>% 
  group_by(grouping1, grouping2, grouping3) %>% 
  mutate(sim = list(rnorm(n = observations, mean = average, sd = sqrt(variance)))) %>% 
  unnest_wider(sim) # use unnest(sim) or unnest_longer(sim) for a "long" format

returning

# A tibble: 6 x 21
# Groups:   grouping1, grouping2, grouping3 [6]
  grouping1 grouping2 grouping3 observations average variance  ...1  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 ...10
                                         
1 a         A         1                   14     334        2  334.  333.  337.  335.  332.  334.  335.  334.  335.  333.
2 a         A         2                   14     336        6  338.  336.  333.  334.  334.  336.  333.  339.  336.  335.
3 a         B         3                   12     243        7  243.  244.  243.  241.  241.  250.  243.  239.  243.  240.
4 b         B         4                   12     645        9  645.  645.  647.  641.  648.  639.  650.  647.  643.  641.
5 b         C         5                   15     233        2  232.  234.  235.  237.  233.  232.  235.  231.  233.  236.
6 b         C         6                   15     625        6  621.  625.  632.  625.  626.  626.  623.  620.  627.  630.
# ... with 5 more variables: ...11 , ...12 , ...13 , ...14 , ...15

loop over dataframe to simulate normal data distrbution

Answers (2)

Related Questions