vahis100
vahis100

Reputation: 101

How to simulate a value in R from a lower and upper boundary assuming that these are all uniformly distributed?

I have the following tibble:

# A tibble: 1,100 x 3
   income       minimum       maximum
    <dbl>         <dbl>         <dbl>
 1     NA            NA            NA
 2      0             0            25
 3      0             0            25
 4     NA            NA            NA
 5      4           100           200

I want to simulate a value from the minimum and maximum value under the assumptions that these follow a uniform distribution.

Any idea how to do this? The simulated values should appear on the right side under the variable income.

Upvotes: 3

Views: 193

Answers (3)

Edo
Edo

Reputation: 7818

This is probably what you're looking for:

df$salary <- runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary

runif default interval is 0-1. With this operation you transform it to your boundaries. It is the fastest solution.

With dplyr if your code is tidyverse oriented:

df %>% mutate(salary = runif(n()) * (upperboundary - lowerboundary) + lowerboundary)

However, it is also possible to define the boundaries directly:

df$salary <- runif(nrow(df), df$lowerboundary, df$upperboundary)

If you had no NAs, this one would be the optimal and fastest solution. It is anyway the most readable. [Thanks to @user20650 for your help!]


Additional details.

How does this work?

runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary

Let's look at 1 and let's define manually a max and a min.

By default runif(1) is equal to:

runif(1, min = 0, max = 1)

Therefore, it return a random number between 0 and 1 according to a uniform distribution.

To return a random number between two different limits, say for instance min = 10 and max = 20, you can do it this way:

runif(1, min = 10, max = 20)

or

min <- 10
max <- 20
runif(1, min = 0, max = 1) * (max - min) + min

if the output of runif is 0:

0 * (20 - 10) + 10
==> 10

if the output of runif is 1:

1 * (20 - 10) + 10
==> 20 - 10 + 10
==> 20

Here also an alternative with dplyr to the solution with apply:

library(dplyr)
df %>% 
  rowwise() %>% 
  mutate(salary = runif(1, lowerboundary, upperboundary)) %>% 
  ungroup()

Here's a speed comparison. The "maths" one is the fastest:

microbenchmark::microbenchmark(
  apply  =  apply(df[-1],1, function(x) runif(1, x[1], x[2])),
  maths  =  runif(nrow(df)) * (df$upperboundary - df$lowerboundary) + df$lowerboundary,
  maths2 =  runif(nrow(df), df$lowerboundary, df$upperboundary),
  dplyr  =  df %>% rowwise() %>% mutate(runif = runif(1, lowerboundary, upperboundary)) %>% ungroup()
)
#> Unit: microseconds
#>    expr    min      lq     mean  median      uq    max neval
#>   apply  907.1  955.90 1175.188 1023.70 1280.90 4455.0   100
#>   maths   16.8   26.05   32.651   31.25   38.65   75.0   100
#>  maths2  117.8  128.00  156.533  136.60  175.15  336.7   100
#>   dplyr 1424.2 1496.60 1821.068 1661.15 1989.20 3952.7   100

Upvotes: 5

akrun
akrun

Reputation: 887088

We can use map2 from purrr

library(purrr)
library(dplyr)
df %>%
   mutate(salary = map2_dbl(lowerboundary, upperboundary, ~ runif(1, .x, .y)))

-output

#   income lowerboundary upperboundary      salary
#1      NA            NA            NA         NaN
#2       0             0            50   33.771312
#3       0             0            50    3.577857
#4      NA            NA            NA         NaN
#5       4           425           600  514.912989
#6      NA            NA            NA         NaN
#7      NA            NA            NA         NaN
#8       4           425           600  516.179313
#9      NA            NA            NA         NaN
#10     12          2400          3000 2815.442543

Upvotes: 1

Duck
Duck

Reputation: 39595

Try this approach with apply(). You can use runif() to generate the value using lowerboundary and upperboundary variables at row level. For those rows with NA you will get NaN. Here the code:

#Code
df$Salary <- apply(df[,-1],1,function(x) {y <- runif(1,x[1],x[2]); y})

Output:

   income lowerboundary upperboundary     Salary
1      NA            NA            NA        NaN
2       0             0            50   26.86049
3       0             0            50   36.44212
4      NA            NA            NA        NaN
5       4           425           600  459.25802
6      NA            NA            NA        NaN
7      NA            NA            NA        NaN
8       4           425           600  535.39891
9      NA            NA            NA        NaN
10     12          2400          3000 2754.34136

Some data used:

#Data
df <- structure(list(income = c(NA, 0L, 0L, NA, 4L, NA, NA, 4L, NA, 
12L), lowerboundary = c(NA, 0L, 0L, NA, 425L, NA, NA, 425L, NA, 
2400L), upperboundary = c(NA, 50L, 50L, NA, 600L, NA, NA, 600L, 
NA, 3000L)), row.names = c(NA, -10L), class = "data.frame")

Upvotes: 2

Related Questions