Reputation: 505

Lm across many columns in a dataframe in R

I have a dataframe with many numerical columns, The first col need to be regressed against the second column, then store the Rsqr value, then the first column against the thrid column , then stores the Rsqr value...and so on. Do this until the nth column has been regressed to the 1st column.

Id like the result to be a dataframe that houses the Rsqr values for colnames regressed.

tested              rqr
col1 v col2         0.56
col1 v col3         0.28
col1 v col4         0.38

I know I havn't supplied data - i'm looking for the approach. I had been using a lm function called within a forloop to do this, but it takes very long. I'm wondering if theres an apply solution to this.

paul

Upvotes: 3

Answers (5)

Ben

Reputation: 42283

Here's an approach using some typical dplyr/purrr/tidyr/broom idioms:

load the libraries:

library(dplyr)
library(purrr)
library(tidyr)
library(broom)

Here's the data:

dt = mtcars # already a dataframe

Here's the sequence to compute separate linear regressions for the columns mpg, cyl, and hp against the column disp, and get the r-squared for each regression:

dt %>% 
  select(disp, mpg, cyl, hp) %>% 
  gather(key = group, 
         value = measurement,
         -disp) %>% 
  group_by(group) %>% 
  nest() %>%
  mutate(model = map(data, ~lm(disp ~ measurement, data = .))) %>% 
  unnest(model %>% map(glance))

Here's the output:

Source: local data frame [3 x 14]

  group            data   model r.squared adj.r.squared    sigma
  (chr)           (chr)   (chr)     (dbl)         (dbl)    (dbl)
1   mpg <tbl_df [32,2]> <S3:lm> 0.7183433     0.7089548 66.86320
2   cyl <tbl_df [32,2]> <S3:lm> 0.8136633     0.8074521 54.38465
3    hp <tbl_df [32,2]> <S3:lm> 0.6255997     0.6131197 77.08950
Variables not shown: statistic (dbl), p.value (dbl), df (int), logLik
  (dbl), AIC (dbl), BIC (dbl), deviance (dbl), df.residual (int)

To narrate the sequence in plain English:

we take the dataframe, then
convert from wide to long format to make a grouping column, then
make a nested dataframe with one row per group, then
compute a linear model for each group, then
extract the output of the models, including the r-squareds, into a dataframe

Upvotes: 2

Akhil Nair

Reputation: 3274

Taking large pointers from @etienne's soln, data.table answer below.

library(data.table)

set.seed(1)
df <- as.data.frame(matrix(rnorm(100),10))
dt = setDT(df)
melt(dt, id.vars = "V1")[!is.na(value) & !is.na(V1),  # rm NAs
                         summary(lm(V1~value))$r.squared,  # lm call
                         variable]  # for each column

   variable         V1
1:       V2 0.14190543
2:       V3 0.51242469
3:       V4 0.05973700
4:       V5 0.05149017
5:       V6 0.37621382
6:       V7 0.14208468
7:       V8 0.38533983
8:       V9 0.26596917
9:      V10 0.01758616

Upvotes: 0

AntoniosK

Reputation: 16121

It's a dplyr approach. The philosophy is to combine column names to create a formula for each regression you want to implement.

library(dplyr)

dt = data.frame(mtcars)

# specify columns to regress
y_col = "disp"
x_col = c("mpg","cyl","hp")

expand.grid(y=y_col, x=x_col, stringsAsFactors = F) %>%
  mutate(formula = paste(y,"~",x)) %>%
  group_by(formula) %>%
  mutate(r_sq = summary(lm(formula, data=dt))$r.squared) %>%
  ungroup()


#       y     x    formula      r_sq
#   (chr) (chr)      (chr)     (dbl)
# 1  disp   mpg disp ~ mpg 0.7183433
# 2  disp   cyl disp ~ cyl 0.8136633
# 3  disp    hp  disp ~ hp 0.6255997

Upvotes: 2

Roland

Reputation: 132706

If you only want R², you don't need to fit linear models, but can simply calculate Pearson's correlation coefficient. This will give you the correlation between all combinations of columns:

cor(yourDataFrame)^2

And this is an example for correlations with the first column:

set.seed(42)
df<-as.data.frame(matrix(rnorm(100), ncol = 4)) 
cor(df, df[,1])^2
#          [,1]
#V1 1.000000000
#V2 0.006508638
#V3 0.110714099
#V4 0.006231468

Upvotes: 2

etienne

Reputation: 3678

Try

set.seed(1)
df<-as.data.frame(matrix(rnorm(100),10)) # reproducible data
paste0('col1 vs col',2:10)->column1 # first column : the regression
sapply(2:10,function(x){summary(lm(df[,1]~df[,x]))$r.squared})->column2 # the rsquared
final<-data.frame('reg'=column1,'rsquared'=column2) # the final data.frame

        final
            reg   rsquared
1  col1 vs col2 0.14190543
2  col1 vs col3 0.51242469
3  col1 vs col4 0.05973700
4  col1 vs col5 0.05149017
5  col1 vs col6 0.37621382
6  col1 vs col7 0.14208468
7  col1 vs col8 0.38533983
8  col1 vs col9 0.26596917
9 col1 vs col10 0.01758616

Upvotes: 0

Lm across many columns in a dataframe in R

Answers (5)

Related Questions