Reputation: 575
Been working on this for two days, and don't see any progress. Say I have 20 numbers, and I want to (without replacement)
There are (20 choose 10) * (10 choose 3) * (7 choose 3) * (4 choose 2) * (2 choose 2) = 4,655,851,200 total groups.
I'd like either a nested list or a data.frame that I can evaluate these groups over. Is that possible to do quickly in R?
Below I have a pretty unwieldy solution for the case when I'm just trying to get one unique group of 5, one unique group of 3, and one unique group of 2, from 10 numbers.
This code actually works, but quickly becomes untenable for larger groups. I'm wondering if there's a fast way to create this grid? It seems like it should be easily possible, but I can't figure out how. Thanks in advance!!
library(data.table)
library(gtools)
n <- 10
group_sizes <- c(5,3,2)
# initialize first group
group_3_values <-
combinations(
n = n,
r = group_sizes[1], # size of the first group!
repeats.allowed = T,
v = 1:n
)
group_3_values <- as.data.table(group_3_values)
group_3_values[, row := 1:nrow(group_3_values)]
output <- as.data.table(group_3_values)
# run for loop for the remaining groups
start_time <- Sys.time()
for (group_size in group_sizes[-1]) { # size of the second, etc. groups
output_list <- list()
for (i in 1:nrow(output)) {
remaining_options <-
setdiff(1:n, output[row == i, .SD, .SDcols = !c('row')])
second_group <- combinations(
n = length(remaining_options),
r = group_size,
repeats.allowed = F,
v = remaining_options
)
second_group <- as.data.table(second_group)
second_group[, row := i]
out <-
merge(
output[row == i,],
second_group,
by = "row",
all.x = T,
all.y = T,
allow.cartesian = T
)
out$row <- NULL
output_list[[i]] <- as.data.table(out)
out <- NULL
}
output <- as.data.table(rbindlist(output_list))
output[, row := 1:nrow(output)]
}
end_time <- Sys.time()
duration <- end_time - start_time
print(duration)
Upvotes: 4
Views: 369
Reputation: 7597
This problem is very similar to Create Combinations in R by Groups. The only difference is that the linked question required that each group be the same size. While this is a small change, it has massive implications from an algorithm standpoint.
As @Ric has pointed out, this is going to be difficult with respect to hardware and computation time.
Before we get started, it is important to note that there are actually 1,163,962,800
total results as the formula given by the OP doesn’t account for the order of the groups. As a small example, consider generating 2 groups of 2.
1 -> (1, 2) (3, 4)
2 -> (1, 3) (2, 4)
3 -> (1, 4) (2, 3)
4 -> (2, 3) (1, 4) ## Same as 3, but the groups are permuted
5 -> (2, 4) (1, 3) ## Same as 2, but the groups are permuted
6 -> (3, 4) (1, 2) ## Same as 1, but the groups are permuted
In order to obtain the correct number, we must account for the groups of same size. In our example, we have 2 groups of 3 and 2 groups for 2, so we must divide our original result by 2! * 2! = 4
:
(choose(20, 10) * choose(10, 3) *
choose(7, 3) * choose(4, 2) *choose(2, 2)) /
(factorial(2) * factorial(2))
#> [1] 1163962800
It is great that we have reduced the size of our output, however we still are talking about greater than 1 billion results.
RcppAlgos
I am the author of the package RcppAlgos
and as of version 2.8.0
we were able to handle partitions of groups of any size and as of 2.8.2
we could iterate over them via comboGroupsIter
.
If your machine could handle it, we could pretty quickly generate all of the results with the function comboGroups
using the parameter nThreads
for producing results in parallel. Needless to say, you would need a decent chunk of memory:
library(RcppAlgos)
desired_grps <- c(10, 3, 3, 2, 2)
v <- seq_len(sum(desired_grps))
comboGroupsCount(v, grpSizes = desired_grps)
#> [1] 1163962800
## If you really wanted to generate them all,
## you will need at least 87 GBs free. The formula used is:
## ((# rows) * (# columns) * (bytes per cell)) / (# bytes in Gibabyte)
(comboGroupsCount(v, grpSizes = desired_grps) * 20 * 4) / 2^30
#> [1] 86.72199
Alternatively, we could utilize the arguments lower
and upper
as we did in the answer to Create Combinations in R by Groups to keep memory low, however this can get cumbersome.
A better option is to use iterators to keep memory low. These iterators are quite flexible, offering various methods for traversing the results and they even offer random access via [[
:
## nThreads is optional, however we choose 4 threads for increased efficiency
it <- comboGroupsIter(v, grpSizes = desired_grps, nThreads = 4)
## To see the next result, call the method nextIter
it@nextIter()
#> Grp1 Grp1 Grp2 Grp2 Grp3 Grp3 Grp3 Grp4 Grp4 Grp4 Grp5 Grp5 Grp5 Grp5 Grp5 Grp5
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#> Grp5 Grp5 Grp5 Grp5
#> 17 18 19 20
## To see the next n results, call nextNIter
it@nextNIter(5)
#> Grp1 Grp1 Grp2 Grp2 Grp3 Grp3 Grp3 Grp4 Grp4 Grp4 Grp5 Grp5 Grp5 Grp5 Grp5
#> [1,] 1 2 3 4 5 6 7 8 9 11 10 12 13 14 15
#> [2,] 1 2 3 4 5 6 7 8 9 12 10 11 13 14 15
#> [3,] 1 2 3 4 5 6 7 8 9 13 10 11 12 14 15
#> [4,] 1 2 3 4 5 6 7 8 9 14 10 11 12 13 15
#> [5,] 1 2 3 4 5 6 7 8 9 15 10 11 12 13 14
#> Grp5 Grp5 Grp5 Grp5 Grp5
#> [1,] 16 17 18 19 20
#> [2,] 16 17 18 19 20
#> [3,] 16 17 18 19 20
#> [4,] 16 17 18 19 20
#> [5,] 16 17 18 19 20
## Note, the state is updated
it@summary()
#> $description
#> [1] "Partition of v of length 20 into 5 groups of sizes: 2, 2, 3, 3, 10"
#>
#> $currentIndex
#> [1] 6
#>
#> $totalResults
#> [1] 1163962800
#>
#> $totalRemaining
#> [1] 1163962794
## Quickly jump to a specific point, m, without generating the m - 1 results before it
it[[1e8]]
#> Grp1 Grp1 Grp2 Grp2 Grp3 Grp3 Grp3 Grp4 Grp4 Grp4 Grp5 Grp5 Grp5 Grp5 Grp5 Grp5
#> 1 10 3 12 5 14 19 6 18 20 2 4 7 8 9 11
#> Grp5 Grp5 Grp5 Grp5
#> 13 15 16 17
## Again, the state is updated
it@summary()
#> $description
#> [1] "Partition of v of length 20 into 5 groups of sizes: 2, 2, 3, 3, 10"
#>
#> $currentIndex
#> [1] 100000000
#>
#> $totalResults
#> [1] 1163962800
#>
#> $totalRemaining
#> [1] 1063962800
## To see the last result
it@back()
#> Grp1 Grp1 Grp2 Grp2 Grp3 Grp3 Grp3 Grp4 Grp4 Grp4 Grp5 Grp5 Grp5 Grp5 Grp5 Grp5
#> 17 20 18 19 11 15 16 12 13 14 1 2 3 4 5 6
#> Grp5 Grp5 Grp5 Grp5
#> 7 8 9 10
The algorithms are very efficient as well and can handle generating all 1,163,962,800
in a reasonable amount of time, especially when we use nThreads
system.time({
## Reset the iterator
it@startOver()
## Generate one million at a time
while (!is.null(it@nextNIter(1e6))) {}
print(it@summary())
})
#> $description
#> [1] "Partition of v of length 20 into 5 groups of sizes: 2, 2, 3, 3, 10"
#>
#> $currentIndex
#> [1] 1163962801
#>
#> $totalResults
#> [1] 1163962800
#>
#> $totalRemaining
#> [1] -1
#> user system elapsed
#> 25.271 3.668 8.771
Not only were we able to iterate over all 1,163,962,800
results in under 10 seconds, we also kept our RAM happy by only requiring round((1e6 * 20 * 4) / 2^20) ~= 76
Megabytes per iteration.
Upvotes: 2
Reputation: 5722
I think materializing a table or even a vector of 4,655,851,200 elements is not a good choice. Perhaps the best you can do is to use combn
to generate each combination, with a callback to run your desired code.
This example generates and prints the elements for every combination and increases a counter n
in the global scope. I use callCC
for early exit at 10 iterations .
x <- 1:20
n<-0
callCC(function(exit)
combn(x, 10, function(i)
combn(setdiff(x, c(i)), 3, function(j)
combn(setdiff(x, c(i,j)), 3, function(k)
combn(setdiff(x, c(i,j,k)), 2, function(l)
combn(setdiff(x, c(i,j,k,l)), 2, function(m){
n<<-n+1
# YOUR CODE HERE
print(paste0(sapply(list(i,j,k,l,m), paste0, collapse=" "), collapse=" | "))
if(n>=10) exit("early exit")
}, simplify=F),
simplify=F),
simplify=F),
simplify=F),
simplify=F))
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 17 18 | 19 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 17 19 | 18 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 17 20 | 18 19"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 18 19 | 17 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 18 20 | 17 19"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 16 | 19 20 | 17 18"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 17 | 16 18 | 19 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 17 | 16 19 | 18 20"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 17 | 16 20 | 18 19"
#> [1] "1 2 3 4 5 6 7 8 9 10 | 11 12 13 | 14 15 17 | 18 19 | 16 20"
#> [1] "early exit"
Upvotes: 4