stats_noob
stats_noob

Reputation: 5925

Simulating Random Draws From a "Hat"

Suppose I have the following 10 variables (num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5):

set.seed(123)

num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)

factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")

factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <-  as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <-  as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))

id = 1:1000

my_data = data.frame(id,num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)


> head(my_data)
  id num_var_1 num_var_2 num_var_3 num_var_4  num_var_5 factor_var_1 factor_var_2 factor_var_3 factor_var_4 factor_var_5
1  1  9.439524  5.021006  4.883963  8.496925  11.965498            B           AA          AAA         CCCC         AAAA
2  2  9.769823  4.800225 12.369379  6.722429  16.501132            B           AA          AAA         AAAA         AAAA
3  3 11.558708  9.910099  4.584108 -4.481653  16.710042            C           AA          BBB         AAAA         CCCC
4  4 10.070508  9.339124 22.192276  3.027154  -2.841578            B           CC          DDD         BBBB         AAAA
5  5 10.129288 -2.746714 11.741359 35.984902 -10.261096            B           AA          AAA         DDDD         DDDD
6  6 11.715065 15.202867  3.847317  9.625850  32.053261            B           AA          CCC         BBBB         EEEE

My Question: I am interested in selecting a random number of variables from this data - and taking random subsets from these variables. (And then repeating this process many times). For example - I would like to record such a randomly generated list:

etc.

I can perform the above manually, but this would take a long time (e.g. 10 iterations). Is there a way to automate this process and in the end, just output this kind of list (10 rows × 2 columns) :

Iteration                                                                                                  Condition
1                                               num_var_2 > 12, factor_var_1 = A, C, factor_var_4 = BBBB, DDDD, EEEE
2            num_var_1 >0, num_var_3 <10, factor_var_2 = AA, BB, CC, factor_var_3 = AAA, factor_var_5 = CCCCC, DDDDD
3                                                  num_var_2 <5, num_var_5 <10, factor_var_1 = B, factor_var_3 = AAA
4                                                                                                factor_var_4 = BBBB

Can someone please show me how to do this?

Upvotes: 2

Views: 269

Answers (2)

ekoam
ekoam

Reputation: 8844

As far as I understand, for a factor or a character vector, we need a function to randomly decide a sample size and then a random sample out of some data points. For a numeric vector, we need a function to randomly decide a cutoff between the minimum and maximum and whether to pick numbers greater or smaller than that cutoff point. Finally, we need to summarize the rules based on the formats provided in this post.

Consider the following function for factors and characters. It first decides a random sample size based on the number of items in x and then randomly samples items from x.

random_pick <- function(x) {
  sample_size <- sample.int(length(x), 1L)
  out <- x[sort(sample.int(length(x), sample_size))]
  list("=", out)
}

Also, consider a function like this for numerics. It finds the min/max, determines the cutoff, and the sign for comparison.

random_trunc <- function(x) {
  rng <- range(x)
  cutoff <- runif(1L, rng[[1L]], rng[[2L]])
  sgn <- c("<", ">")[[sample.int(2L, 1L)]]
  list(sgn, cutoff)
}

Then, we assemble these two functions together for your specific case. Note that for characters we only need to pick the unique ones.

random_select <- function(x) {
  if (is.numeric(x))
    return(random_trunc(x))
  if (is.factor(x))
    return(random_pick(levels(x)))
  random_pick(unique(x))
}

report generates the rules we want based on the formats provided.

report <- function(f) function(...) {
  x <- f(...)
  if (x[[1L]] != "=")
    return(sprintf("%s %.2f", x[[1L]], x[[2L]]))
  sprintf("%s \"%s\"", x[[1L]], paste0(x[[2L]], collapse = ", "))
}

Now we are ready to write our function for randomly generating rules from a dataset. The idea is to first randomly select from all variables (except for the first one id), then apply random_rule to each selected, and finally summarize the results.

random_rule <- function(dt) {
  out <- vapply(
    dt[random_pick(names(dt)[-1L])[[2L]]], 
    report(random_select), character(1L)
  )
  paste(names(out), out, collapse = ", ")
}

As a result, we can simply do this for as many iterations as needed

set.seed(123)
data.frame(iteration = 1:10, results = replicate(10L, random_rule(my_data)))

Results

> set.seed(123)
> data.frame(iteration = 1:10, records = replicate(10L, random_rule(my_data)))
   iteration
1          1
2          2
3          3
4          4
5          5
6          6
7          7
8          8
9          9
10        10
                                                                                                                                                                                                                                                  records
1                                                                                                                                                                                             num_var_2 < 12.51, num_var_3 > 41.50, factor_var_1 = "A, B"
2                                                                                                                                         num_var_1 < 11.16, num_var_3 > 15.63, num_var_4 > -3.87, factor_var_2 = "BB", factor_var_4 = "AAAA, BBBB, DDDD"
3                                                                                                          num_var_1 < 9.87, num_var_2 < -1.32, num_var_3 > -5.54, num_var_4 > 24.09, num_var_5 < 3.28, factor_var_2 = "AA, BB, CC", factor_var_3 = "CCC"
4                                                        num_var_1 > 9.72, num_var_2 > -1.93, num_var_3 < 43.27, num_var_4 < 32.11, num_var_5 > -12.77, factor_var_1 = "B", factor_var_2 = "AA", factor_var_4 = "AAAA, BBBB, DDDD", factor_var_5 = "AAAA"
5                                                                                           num_var_1 > 10.51, num_var_2 > 13.61, num_var_3 > 22.14, num_var_4 < -2.75, factor_var_1 = "A, B, C", factor_var_3 = "AAA", factor_var_4 = "BBBB, DDDD, EEEE"
6                                                                                                                                                                                             factor_var_1 = "A, B, C", factor_var_5 = "BBBB, CCCC, EEEE"
7                                                                         num_var_1 > 9.34, num_var_2 < 18.59, num_var_3 < 7.39, num_var_4 > 16.66, num_var_5 > 35.48, factor_var_1 = "C", factor_var_2 = "AA, BB, CC", factor_var_4 = "AAAA, BBBB, CCCC"
8  num_var_1 > 10.66, num_var_2 > 25.74, num_var_3 > 13.81, num_var_4 > 31.73, num_var_5 > -2.40, factor_var_1 = "A, B, C", factor_var_2 = "AA, BB", factor_var_3 = "AAA, CCC, DDD", factor_var_4 = "AAAA, BBBB, CCCC, DDDD, EEEE", factor_var_5 = "DDDD"
9   num_var_1 < 10.78, num_var_2 < 11.86, num_var_3 < -7.95, num_var_4 < 7.12, num_var_5 > 39.57, factor_var_1 = "A, B, C", factor_var_2 = "AA, BB, CC", factor_var_3 = "CCC", factor_var_4 = "BBBB, EEEE", factor_var_5 = "AAAA, BBBB, CCCC, DDDD, EEEE"
10                                                                                                num_var_1 < 7.63, num_var_2 > 19.04, num_var_4 > 37.87, num_var_5 < -14.85, factor_var_1 = "A, B", factor_var_2 = "AA, CC", factor_var_4 = "AAAA, CCCC"

Put everything together

random_pick <- function(x) {
  sample_size <- sample.int(length(x), 1L)
  out <- x[sort(sample.int(length(x), sample_size))]
  list("=", out)
}

random_trunc <- function(x) {
  rng <- range(x)
  cutoff <- runif(1L, rng[[1L]], rng[[2L]])
  sgn <- c("<", ">")[[sample.int(2L, 1L)]]
  list(sgn, cutoff)
}

random_select <- function(x) {
  if (is.numeric(x))
    return(random_trunc(x))
  if (is.factor(x))
    return(random_pick(levels(x)))
  random_pick(unique(x))
}

report <- function(f) function(...) {
  x <- f(...)
  if (x[[1L]] != "=")
    return(sprintf("%s %.2f", x[[1L]], x[[2L]]))
  sprintf("%s \"%s\"", x[[1L]], paste0(x[[2L]], collapse = ", "))
}

random_rule <- function(dt) {
  out <- vapply(
    dt[random_pick(names(dt)[-1L])[[2L]]], 
    report(random_select), character(1L)
  )
  paste(names(out), out, collapse = ", ")
}

set.seed(123)
data.frame(iteration = 1:10, records = replicate(10L, random_rule(my_data)))

Upvotes: 2

jay.sf
jay.sf

Reputation: 73572

You may define a function FUN(n) that creates a data set as shown in OP.

FUN <- function(n=1e3) {
  num_var_1 <- rnorm(n, 10, 1)
  num_var_2 <- rnorm(n, 10, 5)
  num_var_3 <- rnorm(n, 10, 10)
  num_var_4 <- rnorm(n, 10, 10)
  num_var_5 <- rnorm(n, 10, 10)
  factor_1 <- c("A", "B", "C")
  factor_2 <- c("AA", "BB", "CC")
  factor_3 <- c("AAA", "BBB", "CCC", "DDD")
  factor_4 <- c("AAAA", "BBBB", "CCCC", "DDDD", "EEEE")
  factor_5 <- c("AAAAA", "BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")
  factor_var_1 <- as.factor(sample(factor_1, n, replace=TRUE, 
                                   prob=c(0.3, 0.5, 0.2)))
  factor_var_2 <- as.factor(sample(factor_2, n, replace=TRUE, 
                                   prob=c(0.5, 0.3, 0.2)))
  factor_var_3 <- as.factor(sample(factor_3, n, replace=TRUE, 
                                   prob=c(0.5, 0.2, 0.2, 0.1)))
  factor_var_4 <- as.factor(sample(factor_4, n, replace=TRUE, 
                                   prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
  factor_var_5 <- as.factor(sample(factor_5, n, replace=TRUE, 
                                   prob=c(0.3, 0.2, 0.1, 0.1, 0.1, .2)))
  id <- 1:n
  return(data.frame(id, num_var_1, num_var_2, num_var_3, num_var_4, 
                    num_var_5, factor_var_1, factor_var_2, factor_var_3,
                    factor_var_4, factor_var_5))
}

Next, define (appropriate) expressions as strings in a list evl.

evl <- list(
  c('num_var_2 > 12', 'factor_var_1 %in% c("A", "C")', 
    'factor_var_4 %in% c("BBBB", "DDDD", "EEEE")'),
  c('num_var_1 > 0', 'num_var_3 < 10', 'factor_var_2 %in% c("AA", "BB", "CC")',
    'factor_var_3 %in% "AAA"', 'factor_var_5 %in% c("CCCCC", "DDDDD")'),
  c('num_var_2 < 5', 'num_var_5 < 10', 'factor_var_1 %in% "B"',
    'factor_var_3 %in% "AAA"'),
  c('factor_var_4 %in% "BBBB"')
)

Finally, in Map define a function that subsets the data of one replicateion according to the respective expressions using eval(parse(text=)). Use set.seed() outside the function to prevent the same data from being generated on each iteration.

set.seed(42)
result <- Map(\(x, y) x[with(x, eval(parse(text=paste(y, collapse=' & ')))), ],
              replicate(length(evl), FUN(), simplify=FALSE),
              evl)

Note: R version 4.1.2 (2021-11-01)

Gives

str(result)
# List of 4
# $ :'data.frame':  59 obs. of  11 variables:
#   ..$ id          : int [1:59] 3 6 25 29 32 34 52 54 58 93 ...
# ..$ num_var_1   : num [1:59] 9.99 10.95 9.38 8.53 9.65 ...
# ..$ num_var_2   : num [1:59] 13.6 17.4 20.3 19.3 16.1 ...
# ..$ num_var_3   : num [1:59] 9.42 18.67 6.1 25.71 -2.73 ...
# ..$ num_var_4   : num [1:59] 6.29 9.22 3.68 16.27 15.77 ...
# ..$ num_var_5   : num [1:59] 13.37 18.86 4.89 24.18 26.11 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 3 1 3 1 3 3 1 3 1 1 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 3 3 1 1 1 2 3 3 1 3 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 2 1 1 4 2 1 3 2 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 5 2 2 2 2 2 5 2 4 4 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 2 3 5 4 4 6 1 6 ...
# $ :'data.frame':  53 obs. of  11 variables:
#   ..$ id          : int [1:53] 2 14 28 36 49 59 75 103 134 137 ...
# ..$ num_var_1   : num [1:53] 9.67 11.61 11.22 10.14 10.5 ...
# ..$ num_var_2   : num [1:53] 10.89 7.12 2.38 13.28 10.88 ...
# ..$ num_var_3   : num [1:53] 5.87 7.33 2.88 -10.78 4.09 ...
# ..$ num_var_4   : num [1:53] 19.239 6.261 -0.158 14.586 -0.544 ...
# ..$ num_var_5   : num [1:53] -5.1 21.04 2.81 1.76 27.19 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 1 1 1 2 3 2 3 3 2 3 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 2 2 2 3 3 3 3 2 1 1 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 1 5 5 1 4 4 4 4 1 4 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 4 4 3 3 4 4 4 4 3 ...
# $ :'data.frame':  20 obs. of  11 variables:
#   ..$ id          : int [1:20] 3 44 91 181 222 233 241 287 293 302 ...
# ..$ num_var_1   : num [1:20] 12 10.26 9.65 8.48 12.1 ...
# ..$ num_var_2   : num [1:20] 3.68 3.61 3.28 4.01 1.78 ...
# ..$ num_var_3   : num [1:20] 4.113 -3.481 17.654 0.496 5.457 ...
# ..$ num_var_4   : num [1:20] 9.25 19.79 17.15 -4.72 22.16 ...
# ..$ num_var_5   : num [1:20] 6 8.49 4.31 4.67 1.96 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 2 1 3 1 1 1 1 3 2 1 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 3 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 5 1 1 1 2 6 1 2 ...
# $ :'data.frame':  205 obs. of  11 variables:
#   ..$ id          : int [1:205] 7 10 23 24 27 29 31 33 38 40 ...
# ..$ num_var_1   : num [1:205] 10.23 9.78 8.92 10.16 9.93 ...
# ..$ num_var_2   : num [1:205] 23.49 13.06 12.17 16.88 7.93 ...
# ..$ num_var_3   : num [1:205] 6.33 9.33 14.04 21.66 28.56 ...
# ..$ num_var_4   : num [1:205] 16.33 -1.805 0.509 21.2 15.158 ...
# ..$ num_var_5   : num [1:205] 8.48 -1.31 5.03 15.07 19.48 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 1 1 2 1 2 1 2 2 3 2 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 3 1 1 2 1 1 1 2 1 3 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 2 3 1 3 4 3 1 3 2 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 2 2 2 2 2 2 2 2 2 2 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 2 6 6 2 6 1 2 2 ...

Upvotes: 1

Related Questions