Reputation: 101
I have the following tibble:
# A tibble: 1,100 x 3
income minimum maximum
<dbl> <dbl> <dbl>
1 NA NA NA
2 0 0 25
3 0 0 25
4 NA NA NA
5 4 100 200
I want to simulate a value from the minimum and maximum value under the assumptions that these follow a uniform distribution.
Any idea how to do this? The simulated values should appear on the right side under the variable income.
Upvotes: 3
Views: 193
Reputation: 7818
This is probably what you're looking for:
df$salary <- runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary
runif
default interval is 0-1. With this operation you transform it to your boundaries. It is the fastest solution.
With dplyr
if your code is tidyverse oriented:
df %>% mutate(salary = runif(n()) * (upperboundary - lowerboundary) + lowerboundary)
However, it is also possible to define the boundaries directly:
df$salary <- runif(nrow(df), df$lowerboundary, df$upperboundary)
If you had no NAs, this one would be the optimal and fastest solution. It is anyway the most readable. [Thanks to @user20650 for your help!]
Additional details.
How does this work?
runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary
Let's look at 1 and let's define manually a max and a min.
By default runif(1)
is equal to:
runif(1, min = 0, max = 1)
Therefore, it return a random number between 0 and 1 according to a uniform distribution.
To return a random number between two different limits, say for instance min = 10
and max = 20
, you can do it this way:
runif(1, min = 10, max = 20)
or
min <- 10
max <- 20
runif(1, min = 0, max = 1) * (max - min) + min
if the output of runif is 0:
0 * (20 - 10) + 10
==> 10
if the output of runif is 1:
1 * (20 - 10) + 10
==> 20 - 10 + 10
==> 20
Here also an alternative with dplyr
to the solution with apply
:
library(dplyr)
df %>%
rowwise() %>%
mutate(salary = runif(1, lowerboundary, upperboundary)) %>%
ungroup()
Here's a speed comparison. The "maths" one is the fastest:
microbenchmark::microbenchmark(
apply = apply(df[-1],1, function(x) runif(1, x[1], x[2])),
maths = runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary,
maths2 = runif(nrow(df), df$lowerboundary, df$upperboundary),
dplyr = df %>% rowwise() %>% mutate(runif = runif(1, lowerboundary, upperboundary)) %>% ungroup()
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> apply 907.1 955.90 1175.188 1023.70 1280.90 4455.0 100
#> maths 16.8 26.05 32.651 31.25 38.65 75.0 100
#> maths2 117.8 128.00 156.533 136.60 175.15 336.7 100
#> dplyr 1424.2 1496.60 1821.068 1661.15 1989.20 3952.7 100
Upvotes: 5
Reputation: 887088
We can use map2
from purrr
library(purrr)
library(dplyr)
df %>%
mutate(salary = map2_dbl(lowerboundary, upperboundary, ~ runif(1, .x, .y)))
-output
# income lowerboundary upperboundary salary
#1 NA NA NA NaN
#2 0 0 50 33.771312
#3 0 0 50 3.577857
#4 NA NA NA NaN
#5 4 425 600 514.912989
#6 NA NA NA NaN
#7 NA NA NA NaN
#8 4 425 600 516.179313
#9 NA NA NA NaN
#10 12 2400 3000 2815.442543
Upvotes: 1
Reputation: 39595
Try this approach with apply()
. You can use runif()
to generate the value using lowerboundary
and upperboundary
variables at row level. For those rows with NA
you will get NaN
. Here the code:
#Code
df$Salary <- apply(df[,-1],1,function(x) {y <- runif(1,x[1],x[2]); y})
Output:
income lowerboundary upperboundary Salary
1 NA NA NA NaN
2 0 0 50 26.86049
3 0 0 50 36.44212
4 NA NA NA NaN
5 4 425 600 459.25802
6 NA NA NA NaN
7 NA NA NA NaN
8 4 425 600 535.39891
9 NA NA NA NaN
10 12 2400 3000 2754.34136
Some data used:
#Data
df <- structure(list(income = c(NA, 0L, 0L, NA, 4L, NA, NA, 4L, NA,
12L), lowerboundary = c(NA, 0L, 0L, NA, 425L, NA, NA, 425L, NA,
2400L), upperboundary = c(NA, 50L, 50L, NA, 600L, NA, NA, 600L,
NA, 3000L)), row.names = c(NA, -10L), class = "data.frame")
Upvotes: 2