Umi
Umi

Reputation: 7

How to make all combinations of columns

I have a data frame consisting of 46 variables, and what I would like to do is making subsets per all possible combinations of 2 variables.

For example, if I had a data frame consisting of 3 variables "A", "B", "C", making 3 subsets with variables A and B, A and C, B and C would be my goal.

I would like to assign each of those subsets as covariates of a regression model so that I can try all the combination of 2 variables as covariates.

All I can think of is using loop, but I would appreciate it if anyone could teach me how to do it!

Upvotes: 0

Views: 528

Answers (2)

sorifiend
sorifiend

Reputation: 6307

Following on from the comments, you can do this with nested loops.

This will loop the data and print out pairs without any duplicates:

#your data
char_vec <- c("A", "B", "C", "D")

#values to track the outer loop
i = 1

#use -1 to the length because we cant make a pair from only the single last value
while(i <= length(char_vec)-1){

    #value to track the inner loop
    #start at i+1 to make sure that we don't repeat data
    j = i+1
    while(j <= length(char_vec)){
        #print your data or do whatever you need with it
        #using sep="" will remove the space from joining the values using the paste command
        print(paste(char_vec[i],char_vec[j],sep=""))

        #increase for the next loop
        j <- j + 1
    }
    #increase for the next loop
    i <- i + 1 
}

And the output looks like this:

[1] "AB"
[1] "AC"
[1] "AD"
[1] "BC"
[1] "BD"
[1] "CD"

Upvotes: 0

Waldi
Waldi

Reputation: 41240

combn could help preparing the list of combinations :

apply(combn(c("A","B","C"),2),2,function(x) as.formula(paste0("y~",x[1],'+',x[2])))

[[1]]
y ~ A + B
<environment: 0x0000027286e851c8>

[[2]]
y ~ A + C
<environment: 0x000002728897a380>

[[3]]
y ~ B + C
<environment: 0x000002728692adc0>

You could then use lapply to evaluate the different formulas.

For example with mtcars:

variables <- setdiff(colnames(mtcars),"cyl")
cbn <- apply(combn(variables,2),2,function(x) as.formula(paste0("cyl~",x[1],'+',x[2])))
lapply(cbn,function(x) {summary(eval(substitute(lm(y,mtcars),list(y=x))))})
#> [[1]]
#> 
#> Call:
#> lm(formula = cyl ~ mpg + disp, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.3002 -0.6138  0.1776  0.5486  1.1406 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  5.917863   1.255293   4.714 5.61e-05 ***
#> mpg         -0.092206   0.041352  -2.230   0.0337 *  
#> disp         0.009198   0.002011   4.574 8.27e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.7364 on 29 degrees of freedom
#> Multiple R-squared:  0.8409, Adjusted R-squared:   0.83 
#> F-statistic: 76.66 on 2 and 29 DF,  p-value: 2.647e-12
#> 
#> 
#> [[2]]
#> 
#> Call:
#> lm(formula = cyl ~ mpg + hp, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.5641 -0.4721 -0.1099  0.6273  1.3585 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  7.629183   1.226285   6.221 8.69e-07 ***
#> mpg         -0.153574   0.039052  -3.933  0.00048 ***
#> hp           0.011205   0.003433   3.264  0.00281 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8263 on 29 degrees of freedom
#> Multiple R-squared:  0.7998, Adjusted R-squared:  0.7859 
#> F-statistic: 57.91 on 2 and 29 DF,  p-value: 7.459e-11
#> 
#> 
#> [[3]]
#> 
#> Call:
#> lm(formula = cyl ~ mpg + drat, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.8180 -0.4772  0.2271  0.6694  1.3862 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 13.03441    1.15565  11.279 4.02e-12 ***
#> mpg         -0.20753    0.03737  -5.554 5.45e-06 ***
#> drat        -0.74449    0.42121  -1.767   0.0877 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.918 on 29 degrees of freedom
#> Multiple R-squared:  0.7528, Adjusted R-squared:  0.7358 
#> F-statistic: 44.16 on 2 and 29 DF,  p-value: 1.581e-09
#> 

Upvotes: 1

Related Questions