Reputation: 5925
Suppose I have the following 10 variables (num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5):
set.seed(123)
num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)
factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")
factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <- as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <- as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <- as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <- as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))
id = 1:1000
my_data = data.frame(id,num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)
> head(my_data)
id num_var_1 num_var_2 num_var_3 num_var_4 num_var_5 factor_var_1 factor_var_2 factor_var_3 factor_var_4 factor_var_5
1 1 9.439524 5.021006 4.883963 8.496925 11.965498 B AA AAA CCCC AAAA
2 2 9.769823 4.800225 12.369379 6.722429 16.501132 B AA AAA AAAA AAAA
3 3 11.558708 9.910099 4.584108 -4.481653 16.710042 C AA BBB AAAA CCCC
4 4 10.070508 9.339124 22.192276 3.027154 -2.841578 B CC DDD BBBB AAAA
5 5 10.129288 -2.746714 11.741359 35.984902 -10.261096 B AA AAA DDDD DDDD
6 6 11.715065 15.202867 3.847317 9.625850 32.053261 B AA CCC BBBB EEEE
My Question: I am interested in selecting a random number of variables from this data - and taking random subsets from these variables. (And then repeating this process many times). For example - I would like to record such a randomly generated list:
Iteration 1: num_var_2 > 12, factor_var_1 = "A, C", factor_var_4 = "BBBB, DDDD, EEEE"
Iteration 2: num_var_1 >0, num_var_3 <10, factor_var_2 = "AA, BB, CC", factor_var_3 = "AAA", factor_var_5 = "CCCCC, DDDDD"
Iteration 3: num_var_2 <5, num_var_5 <10, factor_var_1 = "B", factor_var_3 = "AAA"
Iteration 4 : factor_var_4 = "BBBB"
etc.
I can perform the above manually, but this would take a long time (e.g. 10 iterations). Is there a way to automate this process and in the end, just output this kind of list (10 rows × 2 columns) :
Iteration Condition
1 num_var_2 > 12, factor_var_1 = A, C, factor_var_4 = BBBB, DDDD, EEEE
2 num_var_1 >0, num_var_3 <10, factor_var_2 = AA, BB, CC, factor_var_3 = AAA, factor_var_5 = CCCCC, DDDDD
3 num_var_2 <5, num_var_5 <10, factor_var_1 = B, factor_var_3 = AAA
4 factor_var_4 = BBBB
Can someone please show me how to do this?
Upvotes: 2
Views: 269
Reputation: 8844
As far as I understand, for a factor
or a character
vector, we need a function to randomly decide a sample size and then a random sample out of some data points. For a numeric
vector, we need a function to randomly decide a cutoff between the minimum and maximum and whether to pick numbers greater or smaller than that cutoff point. Finally, we need to summarize the rules based on the formats provided in this post.
Consider the following function for factor
s and character
s. It first decides a random sample size based on the number of items in x
and then randomly samples items from x
.
random_pick <- function(x) {
sample_size <- sample.int(length(x), 1L)
out <- x[sort(sample.int(length(x), sample_size))]
list("=", out)
}
Also, consider a function like this for numeric
s. It finds the min/max, determines the cutoff, and the sign for comparison.
random_trunc <- function(x) {
rng <- range(x)
cutoff <- runif(1L, rng[[1L]], rng[[2L]])
sgn <- c("<", ">")[[sample.int(2L, 1L)]]
list(sgn, cutoff)
}
Then, we assemble these two functions together for your specific case. Note that for character
s we only need to pick the unique ones.
random_select <- function(x) {
if (is.numeric(x))
return(random_trunc(x))
if (is.factor(x))
return(random_pick(levels(x)))
random_pick(unique(x))
}
report
generates the rules we want based on the formats provided.
report <- function(f) function(...) {
x <- f(...)
if (x[[1L]] != "=")
return(sprintf("%s %.2f", x[[1L]], x[[2L]]))
sprintf("%s \"%s\"", x[[1L]], paste0(x[[2L]], collapse = ", "))
}
Now we are ready to write our function for randomly generating rules from a dataset. The idea is to first randomly select from all variables (except for the first one id
), then apply random_rule
to each selected, and finally summarize the results.
random_rule <- function(dt) {
out <- vapply(
dt[random_pick(names(dt)[-1L])[[2L]]],
report(random_select), character(1L)
)
paste(names(out), out, collapse = ", ")
}
As a result, we can simply do this for as many iterations as needed
set.seed(123)
data.frame(iteration = 1:10, results = replicate(10L, random_rule(my_data)))
Results
> set.seed(123)
> data.frame(iteration = 1:10, records = replicate(10L, random_rule(my_data)))
iteration
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
records
1 num_var_2 < 12.51, num_var_3 > 41.50, factor_var_1 = "A, B"
2 num_var_1 < 11.16, num_var_3 > 15.63, num_var_4 > -3.87, factor_var_2 = "BB", factor_var_4 = "AAAA, BBBB, DDDD"
3 num_var_1 < 9.87, num_var_2 < -1.32, num_var_3 > -5.54, num_var_4 > 24.09, num_var_5 < 3.28, factor_var_2 = "AA, BB, CC", factor_var_3 = "CCC"
4 num_var_1 > 9.72, num_var_2 > -1.93, num_var_3 < 43.27, num_var_4 < 32.11, num_var_5 > -12.77, factor_var_1 = "B", factor_var_2 = "AA", factor_var_4 = "AAAA, BBBB, DDDD", factor_var_5 = "AAAA"
5 num_var_1 > 10.51, num_var_2 > 13.61, num_var_3 > 22.14, num_var_4 < -2.75, factor_var_1 = "A, B, C", factor_var_3 = "AAA", factor_var_4 = "BBBB, DDDD, EEEE"
6 factor_var_1 = "A, B, C", factor_var_5 = "BBBB, CCCC, EEEE"
7 num_var_1 > 9.34, num_var_2 < 18.59, num_var_3 < 7.39, num_var_4 > 16.66, num_var_5 > 35.48, factor_var_1 = "C", factor_var_2 = "AA, BB, CC", factor_var_4 = "AAAA, BBBB, CCCC"
8 num_var_1 > 10.66, num_var_2 > 25.74, num_var_3 > 13.81, num_var_4 > 31.73, num_var_5 > -2.40, factor_var_1 = "A, B, C", factor_var_2 = "AA, BB", factor_var_3 = "AAA, CCC, DDD", factor_var_4 = "AAAA, BBBB, CCCC, DDDD, EEEE", factor_var_5 = "DDDD"
9 num_var_1 < 10.78, num_var_2 < 11.86, num_var_3 < -7.95, num_var_4 < 7.12, num_var_5 > 39.57, factor_var_1 = "A, B, C", factor_var_2 = "AA, BB, CC", factor_var_3 = "CCC", factor_var_4 = "BBBB, EEEE", factor_var_5 = "AAAA, BBBB, CCCC, DDDD, EEEE"
10 num_var_1 < 7.63, num_var_2 > 19.04, num_var_4 > 37.87, num_var_5 < -14.85, factor_var_1 = "A, B", factor_var_2 = "AA, CC", factor_var_4 = "AAAA, CCCC"
Put everything together
random_pick <- function(x) {
sample_size <- sample.int(length(x), 1L)
out <- x[sort(sample.int(length(x), sample_size))]
list("=", out)
}
random_trunc <- function(x) {
rng <- range(x)
cutoff <- runif(1L, rng[[1L]], rng[[2L]])
sgn <- c("<", ">")[[sample.int(2L, 1L)]]
list(sgn, cutoff)
}
random_select <- function(x) {
if (is.numeric(x))
return(random_trunc(x))
if (is.factor(x))
return(random_pick(levels(x)))
random_pick(unique(x))
}
report <- function(f) function(...) {
x <- f(...)
if (x[[1L]] != "=")
return(sprintf("%s %.2f", x[[1L]], x[[2L]]))
sprintf("%s \"%s\"", x[[1L]], paste0(x[[2L]], collapse = ", "))
}
random_rule <- function(dt) {
out <- vapply(
dt[random_pick(names(dt)[-1L])[[2L]]],
report(random_select), character(1L)
)
paste(names(out), out, collapse = ", ")
}
set.seed(123)
data.frame(iteration = 1:10, records = replicate(10L, random_rule(my_data)))
Upvotes: 2
Reputation: 73572
You may define a function FUN(n)
that creates a data set as shown in OP.
FUN <- function(n=1e3) {
num_var_1 <- rnorm(n, 10, 1)
num_var_2 <- rnorm(n, 10, 5)
num_var_3 <- rnorm(n, 10, 10)
num_var_4 <- rnorm(n, 10, 10)
num_var_5 <- rnorm(n, 10, 10)
factor_1 <- c("A", "B", "C")
factor_2 <- c("AA", "BB", "CC")
factor_3 <- c("AAA", "BBB", "CCC", "DDD")
factor_4 <- c("AAAA", "BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA", "BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")
factor_var_1 <- as.factor(sample(factor_1, n, replace=TRUE,
prob=c(0.3, 0.5, 0.2)))
factor_var_2 <- as.factor(sample(factor_2, n, replace=TRUE,
prob=c(0.5, 0.3, 0.2)))
factor_var_3 <- as.factor(sample(factor_3, n, replace=TRUE,
prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <- as.factor(sample(factor_4, n, replace=TRUE,
prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <- as.factor(sample(factor_5, n, replace=TRUE,
prob=c(0.3, 0.2, 0.1, 0.1, 0.1, .2)))
id <- 1:n
return(data.frame(id, num_var_1, num_var_2, num_var_3, num_var_4,
num_var_5, factor_var_1, factor_var_2, factor_var_3,
factor_var_4, factor_var_5))
}
Next, define (appropriate) expressions as strings in a list evl
.
evl <- list(
c('num_var_2 > 12', 'factor_var_1 %in% c("A", "C")',
'factor_var_4 %in% c("BBBB", "DDDD", "EEEE")'),
c('num_var_1 > 0', 'num_var_3 < 10', 'factor_var_2 %in% c("AA", "BB", "CC")',
'factor_var_3 %in% "AAA"', 'factor_var_5 %in% c("CCCCC", "DDDDD")'),
c('num_var_2 < 5', 'num_var_5 < 10', 'factor_var_1 %in% "B"',
'factor_var_3 %in% "AAA"'),
c('factor_var_4 %in% "BBBB"')
)
Finally, in Map
define a function that subsets the data of one replicate
ion according to the respective expressions using eval(parse(text=))
. Use set.seed()
outside the function to prevent the same data from being generated on each iteration.
set.seed(42)
result <- Map(\(x, y) x[with(x, eval(parse(text=paste(y, collapse=' & ')))), ],
replicate(length(evl), FUN(), simplify=FALSE),
evl)
Note: R version 4.1.2 (2021-11-01)
str(result)
# List of 4
# $ :'data.frame': 59 obs. of 11 variables:
# ..$ id : int [1:59] 3 6 25 29 32 34 52 54 58 93 ...
# ..$ num_var_1 : num [1:59] 9.99 10.95 9.38 8.53 9.65 ...
# ..$ num_var_2 : num [1:59] 13.6 17.4 20.3 19.3 16.1 ...
# ..$ num_var_3 : num [1:59] 9.42 18.67 6.1 25.71 -2.73 ...
# ..$ num_var_4 : num [1:59] 6.29 9.22 3.68 16.27 15.77 ...
# ..$ num_var_5 : num [1:59] 13.37 18.86 4.89 24.18 26.11 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 3 1 3 1 3 3 1 3 1 1 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 3 3 1 1 1 2 3 3 1 3 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 2 1 1 4 2 1 3 2 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 5 2 2 2 2 2 5 2 4 4 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 2 3 5 4 4 6 1 6 ...
# $ :'data.frame': 53 obs. of 11 variables:
# ..$ id : int [1:53] 2 14 28 36 49 59 75 103 134 137 ...
# ..$ num_var_1 : num [1:53] 9.67 11.61 11.22 10.14 10.5 ...
# ..$ num_var_2 : num [1:53] 10.89 7.12 2.38 13.28 10.88 ...
# ..$ num_var_3 : num [1:53] 5.87 7.33 2.88 -10.78 4.09 ...
# ..$ num_var_4 : num [1:53] 19.239 6.261 -0.158 14.586 -0.544 ...
# ..$ num_var_5 : num [1:53] -5.1 21.04 2.81 1.76 27.19 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 1 1 1 2 3 2 3 3 2 3 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 2 2 2 3 3 3 3 2 1 1 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 1 5 5 1 4 4 4 4 1 4 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 4 4 3 3 4 4 4 4 3 ...
# $ :'data.frame': 20 obs. of 11 variables:
# ..$ id : int [1:20] 3 44 91 181 222 233 241 287 293 302 ...
# ..$ num_var_1 : num [1:20] 12 10.26 9.65 8.48 12.1 ...
# ..$ num_var_2 : num [1:20] 3.68 3.61 3.28 4.01 1.78 ...
# ..$ num_var_3 : num [1:20] 4.113 -3.481 17.654 0.496 5.457 ...
# ..$ num_var_4 : num [1:20] 9.25 19.79 17.15 -4.72 22.16 ...
# ..$ num_var_5 : num [1:20] 6 8.49 4.31 4.67 1.96 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 2 1 3 1 1 1 1 3 2 1 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 3 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 5 1 1 1 2 6 1 2 ...
# $ :'data.frame': 205 obs. of 11 variables:
# ..$ id : int [1:205] 7 10 23 24 27 29 31 33 38 40 ...
# ..$ num_var_1 : num [1:205] 10.23 9.78 8.92 10.16 9.93 ...
# ..$ num_var_2 : num [1:205] 23.49 13.06 12.17 16.88 7.93 ...
# ..$ num_var_3 : num [1:205] 6.33 9.33 14.04 21.66 28.56 ...
# ..$ num_var_4 : num [1:205] 16.33 -1.805 0.509 21.2 15.158 ...
# ..$ num_var_5 : num [1:205] 8.48 -1.31 5.03 15.07 19.48 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 1 1 2 1 2 1 2 2 3 2 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 3 1 1 2 1 1 1 2 1 3 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 2 3 1 3 4 3 1 3 2 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 2 2 2 2 2 2 2 2 2 2 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 2 6 6 2 6 1 2 2 ...
Upvotes: 1