Parseltongue
Parseltongue

Reputation: 11657

Create simulated dataframe in dplyr from another dataframe

Let's say I have the following summary of pilot data:

pilot_data = read.table(text = "pairing male dv_mean dv_sd
AA  0   1.4377551   11.99576    
AA  1   0.1745918   10.03553    
AB  0   12.6574286  17.76540    
AB  1   9.5337037   13.92486    
BA  0   8.8971111   16.49538    
BA  1   8.8706557   17.13532    
BB  0   1.6339286   12.72830    
BB  1   -0.1433333  13.68828", header = T)

I'd like to create a simulated dataset in dplyr for each pairing, male combination that has the same mean and standard deviation as that cell. So, for example, if I wanted to have 300 rows for each pairing, male combination, I'd do something like:

tester = pilot_data %>% group_by(pairing, male) %>%
  mutate(simulated_data = rnorm(mean = dv_mean, sd = dv_sd, n = 300))

Except this obviously won't work because of a recycling error. I can use a for loop to do this and append a dataset to itself over and over again, but I'm trying to learn how to do this in a dplyr chain.

What's the best way to achieve this?

Upvotes: 1

Views: 71

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 101343

Here is a data.table option

> setDT(pilot_data)[, .(simulated_data = rnorm(300, dv_mean, dv_sd)), .(pairing, male)]
      pairing male simulated_data
   1:      AA    0     -11.068416
   2:      AA    0      -4.925878
   3:      AA    0     -11.044629
   4:      AA    0      -7.946300
   5:      AA    0       3.352702
  ---
2396:      BB    1       8.966713
2397:      BB    1     -14.925273
2398:      BB    1     -11.957720
2399:      BB    1      17.335359
2400:      BB    1      17.824735

Upvotes: 1

akrun
akrun

Reputation: 887118

We can use summarise instead of mutate as summarise can return more than 1 row per group whereas mutate is strict in returning the same length as original number of rows

 library(dplyr)
 pilot_data %>% 
     group_by(pairing, male) %>% 
     summarise(simulated_data = rnorm(mean = dv_mean, 
        sd = dv_sd, n = 300), .groups = 'drop')

NOTE: Also, the number of rows per group is all 1. So, it works because rnorm requires single value for mean, sd


Or another option is to use rowwise, return a list column and then unnest (in case there are duplicate rows for groups)

library(tidyr)
pilot_data %>%
   rowwise %>%
   mutate(simulated_data = list(rnorm(mean = dv_mean, sd = dv_sd,
         n = 300))) %>%
   unnest(c(simulated_data))

Upvotes: 2

Related Questions