Reputation: 1117
I have a dataframe with hundreds of lines and 10s of columns, and need to find the rows, which sum is within a range. I searched for permutations and computations and subset sum problem, but none of the provided solutions seem to match the desired result.
Please advice me, whether there exists a package that solves such problem. Is there a function or any vectorised way to solve these kind of "iterations" in R?
# sample dataframe
x <- data.frame(a=c("A","B","C","D"),b=c(1,2,1,1))
Let assume, the sum of any accepted combination is 3, then the desired result could be a list like
[[1]] # combination 1
[1]
1,2 # lists all rows used
[2]
1,2 # lists all values use
[[2]] # combination 2
[1]
2,1 # lists all rows used
[2]
2,1 # lists all values used
[[3]] # combination 3
[1]
2,4 # lists all rows used
[2]
2,1 # lists all values used
[[4]] # combination 4
[1]
1,3,4 # lists all rows used
[2]
1,1,1 # lists all values used
(# comments: these are only added for explanatory purposes)
Note:
not all possible combinations are required, and not all values have to be used!
a row must only be used ONCE in a given combination (i.e. summing 3 times row one is no option!)
combinations could be sum(x[1:2,2])
as well as (x[1,2] x[2,2]+ .... + x[n,2]))
Upvotes: 0
Views: 965
Reputation: 10761
I hope I understand your question correctly. But, supposing we have some matrix dat
, we wish to sum up (for each column) the different combinations of rows. We can accomplish this using the *apply
family of functions, along with combn
.
Here's what we do:
apply
)lapply
and apply
)combn
function within an sapply
calldat
set.seed(123)
dat <- matrix(rnorm(5 * 6), nrow = 5, ncol = 6)
dat
big_list <- apply(dat, 2, FUN = function(matcol) # over the columns of dat
lapply(sapply(1:5, FUN = function(x) combn(1:5, x)), # loop through unique combinations of rows in dat
FUN = function(combs)
apply(combs, 2, #over the columns of unique combinations
FUN = function(rows)
data.frame(
'rows_used' = paste(rows, collapse = ', '),
'n_rows' = length(rows),
'sum' = sum(matcol[rows]))))) #sum up the rows
[[1]] # column
[[1]][[1]] #[[n_rows]][[n_comb]]
rows_used n_rows sum
1 1 1 -0.5604756
[[1]][[2]]
rows_used n_rows sum
1 2 1 -0.2301775
[[1]][[3]]
rows_used n_rows sum
1 3 1 1.558708
After we've gone through each column and each combination of rows, we can extract the data from the list
into a data.frame
. Suppose, for example, we were interested in the sums from column 6:
column <- 6
df_from_list <- do.call('rbind',
lapply(big_list[[column]],
FUN = function(x) do.call('rbind', x)))
rows_used n_rows sum
1 1 1 -1.6866933
2 2 1 0.8377870
3 3 1 0.1533731
4 4 1 -1.1381369
5 5 1 1.2538149
Then, we can use the subset
function (or dplyr::filter
) to get all combinations of n rows in column 6 where the sum is >= 0 and <= 0.5:
subset(df_from_list, sum >= 0 & sum <= .5)
rows_used n_rows sum
3 3 1 0.1533731
15 4, 5 2 0.1156780
18 1, 2, 5 3 0.4049087
25 3, 4, 5 3 0.2690511
It wouldn't surprise if this calculation method doesn't scale well, and I'm sure there is a more efficient solution. The structure of the problem as I've solved it results in a nested list
structure, meaning a user should be familiar with the list
object in R
.
Upvotes: 2