syntonicC
syntonicC

Reputation: 370

Subsetting dataframe in R based on specifc pattern of columns

I am computing the r-squared for multiple pairs of columns in a data frame. I can do this by individually writing out the code for each pair but I wanted to automate this using apply or some other vectorized approach based on the pattern of columns I am choosing from the data frame.

Sample data:

set.seed(1234)
dat <- data.frame(replicate(18,rnorm(10)))

To get the r-squared for column 1 v. 2:

fit <- lm(dat[,1] ~ dat[,2])
summary(fit)$r.squared

But I would like to do all of the following combinations: {1, 2}, {2, 3}, {3, 1}, {4, 5}, {5, 6}, {6, 4}... etc. through the 18th column.

In other words, all combinations of three with a window moving over to the next set of three each time. This way I can just call the function once on the whole data frame and get all the r-squared values at once instead of repeating the code 18 times.

Upvotes: 1

Views: 142

Answers (5)

RHertel
RHertel

Reputation: 23788

You can try this:

v1 <- c(1:ncol(dat)) 
v2 <- v1 + c(1L, 1L, -2L) 
m <- cbind(v1,v2)
fit <- lapply(1:length(dat),function(x) lm(dat[,m[x,1]]~dat[,m[x,2]]))
rsq <- sapply(1:length(dat), function(x) summary(fit[[x]])$r.squared)

Upvotes: 1

AntoniosK
AntoniosK

Reputation: 16121

An alternative process using dplyr package:

set.seed(1234)
dat <- data.frame(replicate(18,rnorm(10)))

library(dplyr)


data.frame(colnames = names(dat)) %>%        # get the names of columns
  mutate(group = cumsum(ifelse(row_number() %in% seq(1,ncol(dat),3),1,0))) %>%  # create group id based on 3 consecutive columns
  group_by(group) %>%                        # for each group id
  do({cb = combn(.$colnames,2)               # create combinations of column names
      data.frame(col1 = cb[1,],
                 col2 = cb[2,])}) %>%
  mutate(formula = paste(col1,"~",col2)) %>% # create a formula for each combination
  rowwise() %>%                              # for each row/formula
  do(data.frame(formula = .$formula,
                r.sq = summary(lm(.$formula, data=dat))$r.squared)) # create model and get r squared


#      formula         r.sq
#        (chr)        (dbl)
# 1    X1 ~ X2 3.072421e-02
# 2    X1 ~ X3 3.056746e-01
# 3    X2 ~ X3 7.708176e-02
# 4    X4 ~ X5 7.293980e-01
# 5    X4 ~ X6 3.244157e-01
# 6    X5 ~ X6 2.231886e-01
# 7    X7 ~ X8 6.637355e-03
# 8    X7 ~ X9 1.497414e-06
# 9    X8 ~ X9 9.758725e-02
# 10 X10 ~ X11 2.728225e-01
# 11 X10 ~ X12 5.973809e-02
# 12 X11 ~ X12 1.196112e-01
# 13 X13 ~ X14 5.541950e-02
# 14 X13 ~ X15 3.488573e-02
# 15 X14 ~ X15 2.519877e-02
# 16 X16 ~ X17 7.004510e-04
# 17 X16 ~ X18 8.827935e-02
# 18 X17 ~ X18 1.112862e-01

If you prefer you can replace mutate(group = cumsum(ifelse(row_number() %in% seq(1,ncol(dat),3),1,0))) (create pairs based on a window of 3 consecutive columns) with mutate(group = ntile(row_number(),6)) (create 6 groups of 3 consecutive columns).

Upvotes: 0

user295691
user295691

Reputation: 7248

If you just need the r-squared value, you can use the cor function to give the correlation matrix. The r2 is just the square of the values in that matrix.

Upvotes: 0

DH4wes
DH4wes

Reputation: 71

Or in one line:

results <- sapply(1:ncol(dat), function(x) summary( lm( dat[ , x ] ~ dat[ ,ifelse( x%%3 != 0, x+1, x-2)]) )$r.squared )

Upvotes: 3

Abderyt
Abderyt

Reputation: 109

It should work:

results <- apply(combn(colnames(dat), 2), 2, function(x)summary(lm(dat[, x[1]] ~ dat[, x[2]]))$r.squared)

Upvotes: 0

Related Questions