Reputation: 463
I would like to loop over a dataframe that contains the parameters for a data simulation. Ideally, I could avoid writing a for loop for this and do it in the tidyverse, but I haven't found a solution that works yet.
Consider a dataframe with parameters:
grouping1 <- c('a','a', 'a', 'b', 'b', 'b')
grouping2 <- c('A','A', 'B', 'B', 'C', 'C')
grouping3 <- c('1','2', '3', '4', '5', '6')
observations <- c(14, 14, 12, 12, 15, 15)
average <- c(334, 336, 243, 645, 233, 625)
variance <- c(2, 6, 7, 9, 2, 6)
my_data <- cbind(grouping1,grouping2,grouping3,observations,average,variance)
And a simple pipe to simulate values on the basis of those parameters:
my_generated_data <- my_data %>%
group_by(grouping1,grouping2,grouping3) %>%
rnorm(n=observations, mean=average, sd=variance)
But this does not work. For one thing, I get an error about an unused '.' argument, but the following doesn't work either:
my_generated_data <- my_data %>%
group_by(grouping1,grouping2,grouping3) %>%
rnorm(n=.$observations, mean=.$average, sd=.$variance)
Another issue is that the number of generated observations differs by the grouping level (e.g. 12, 14, or 15). This shouldn't be a major issue, but it does mean the generated dataframe will have to be long, not wide given the uneven # of rows. Thank you in advance for the help.
Upvotes: 0
Views: 47
Reputation: 16978
Joans already did answer this question, but I want to add a solution using tidyverse
.
First of all, with R >= 4.0, you don't need the stringsAsFactors
argument when defining data.frames. The definition of my_data
is simply
my_data <- data.frame(grouping1,grouping2,grouping3,observations,average,variance)
Now we can use
library(dplyr)
my_generated_data <- my_data %>%
group_by(grouping1, grouping2, grouping3) %>%
mutate(sim = list(rnorm(n = observations, mean = average, sd = sqrt(variance))))
to get
# Groups: grouping1, grouping2, grouping3 [6]
grouping1 grouping2 grouping3 observations average variance sim
<chr> <chr> <chr> <dbl> <dbl> <dbl> <list>
1 a A 1 14 334 2 <dbl [14]>
2 a A 2 14 336 6 <dbl [14]>
3 a B 3 12 243 7 <dbl [12]>
4 b B 4 12 645 9 <dbl [12]>
5 b C 5 15 233 2 <dbl [15]>
6 b C 6 15 625 6 <dbl [15]>
where column sim
contains a list of simulated data based on the observations
, average
and variance
in the same row. You could now either extract this list using for example
my_generated_list[[1, "sim]]
#> [[1]]
#> [1] 333.9635 335.0959 334.2201 335.6582 335.0773 335.6701 331.9570 334.0041 332.9627 333.5582 335.6228 334.4168 330.4192
#> [14] 335.2726
or unnest
it
my_data %>%
group_by(grouping1, grouping2, grouping3) %>%
mutate(sim = list(rnorm(n = observations, mean = average, sd = sqrt(variance)))) %>%
unnest_wider(sim) # use unnest(sim) or unnest_longer(sim) for a "long" format
returning
# A tibble: 6 x 21
# Groups: grouping1, grouping2, grouping3 [6]
grouping1 grouping2 grouping3 observations average variance ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a A 1 14 334 2 334. 333. 337. 335. 332. 334. 335. 334. 335. 333.
2 a A 2 14 336 6 338. 336. 333. 334. 334. 336. 333. 339. 336. 335.
3 a B 3 12 243 7 243. 244. 243. 241. 241. 250. 243. 239. 243. 240.
4 b B 4 12 645 9 645. 645. 647. 641. 648. 639. 650. 647. 643. 641.
5 b C 5 15 233 2 232. 234. 235. 237. 233. 232. 235. 231. 233. 236.
6 b C 6 15 625 6 621. 625. 632. 625. 626. 626. 623. 620. 627. 630.
# ... with 5 more variables: ...11 <dbl>, ...12 <dbl>, ...13 <dbl>, ...14 <dbl>, ...15 <dbl>
Upvotes: 2
Reputation: 1810
The first problem is that you are combining vectors to a matrix. The type of the resulting matrix is character
since at least one vector is character
. The type you need to store the vectors retaining their types is a data.frame
, like
my_data <- data.frame(grouping1 = grouping1,
grouping2 = grouping2,
grouping3 = grouping3,
observations = observations,
average = average,
variance = variance,
stringsAsFactors = FALSE)
Now, you can loop over the rows of the dataframe and simulate your data. Since the length of the simulation depends on the observation
-column like you mentioned, create a list of observations:
simulationList <- lapply(1:NROW(my_data), function(k) {
rnorm(n = my_data$observations[k], mean = my_data$average[k], sd = sqrt(my_data$variance[k]))
})
You now want to add the simulations to your dataframe. Whether this is a good idea, is your part. But you could achieve this by expanding (replicating) your dataframe to a fitting length and add the simulations like
my_data <- my_data[rep(1:NROW(my_data), times = my_data$observations),]
my_data$simulation <- unlist(simulationList)
Upvotes: 2