Reputation: 505
I have a dataframe with many numerical columns, The first col need to be regressed against the second column, then store the Rsqr value, then the first column against the thrid column , then stores the Rsqr value...and so on. Do this until the nth column has been regressed to the 1st column.
Id like the result to be a dataframe that houses the Rsqr values for colnames regressed.
ie
tested rqr
col1 v col2 0.56
col1 v col3 0.28
col1 v col4 0.38
I know I havn't supplied data - i'm looking for the approach. I had been using a lm function called within a forloop to do this, but it takes very long. I'm wondering if theres an apply solution to this.
paul
Upvotes: 3
Views: 4670
Reputation: 42283
Here's an approach using some typical dplyr/purrr/tidyr/broom
idioms:
load the libraries:
library(dplyr)
library(purrr)
library(tidyr)
library(broom)
Here's the data:
dt = mtcars # already a dataframe
Here's the sequence to compute separate linear regressions for the columns mpg
, cyl
, and hp
against the column disp
, and get the r-squared for each regression:
dt %>%
select(disp, mpg, cyl, hp) %>%
gather(key = group,
value = measurement,
-disp) %>%
group_by(group) %>%
nest() %>%
mutate(model = map(data, ~lm(disp ~ measurement, data = .))) %>%
unnest(model %>% map(glance))
Here's the output:
Source: local data frame [3 x 14]
group data model r.squared adj.r.squared sigma
(chr) (chr) (chr) (dbl) (dbl) (dbl)
1 mpg <tbl_df [32,2]> <S3:lm> 0.7183433 0.7089548 66.86320
2 cyl <tbl_df [32,2]> <S3:lm> 0.8136633 0.8074521 54.38465
3 hp <tbl_df [32,2]> <S3:lm> 0.6255997 0.6131197 77.08950
Variables not shown: statistic (dbl), p.value (dbl), df (int), logLik
(dbl), AIC (dbl), BIC (dbl), deviance (dbl), df.residual (int)
To narrate the sequence in plain English:
Upvotes: 2
Reputation: 3274
Taking large pointers from @etienne's soln, data.table
answer below.
library(data.table)
set.seed(1)
df <- as.data.frame(matrix(rnorm(100),10))
dt = setDT(df)
melt(dt, id.vars = "V1")[!is.na(value) & !is.na(V1), # rm NAs
summary(lm(V1~value))$r.squared, # lm call
variable] # for each column
variable V1
1: V2 0.14190543
2: V3 0.51242469
3: V4 0.05973700
4: V5 0.05149017
5: V6 0.37621382
6: V7 0.14208468
7: V8 0.38533983
8: V9 0.26596917
9: V10 0.01758616
Upvotes: 0
Reputation: 16121
It's a dplyr
approach. The philosophy is to combine column names to create a formula for each regression you want to implement.
library(dplyr)
dt = data.frame(mtcars)
# specify columns to regress
y_col = "disp"
x_col = c("mpg","cyl","hp")
expand.grid(y=y_col, x=x_col, stringsAsFactors = F) %>%
mutate(formula = paste(y,"~",x)) %>%
group_by(formula) %>%
mutate(r_sq = summary(lm(formula, data=dt))$r.squared) %>%
ungroup()
# y x formula r_sq
# (chr) (chr) (chr) (dbl)
# 1 disp mpg disp ~ mpg 0.7183433
# 2 disp cyl disp ~ cyl 0.8136633
# 3 disp hp disp ~ hp 0.6255997
Upvotes: 2
Reputation: 132706
If you only want R², you don't need to fit linear models, but can simply calculate Pearson's correlation coefficient. This will give you the correlation between all combinations of columns:
cor(yourDataFrame)^2
And this is an example for correlations with the first column:
set.seed(42)
df<-as.data.frame(matrix(rnorm(100), ncol = 4))
cor(df, df[,1])^2
# [,1]
#V1 1.000000000
#V2 0.006508638
#V3 0.110714099
#V4 0.006231468
Upvotes: 2
Reputation: 3678
Try
set.seed(1)
df<-as.data.frame(matrix(rnorm(100),10)) # reproducible data
paste0('col1 vs col',2:10)->column1 # first column : the regression
sapply(2:10,function(x){summary(lm(df[,1]~df[,x]))$r.squared})->column2 # the rsquared
final<-data.frame('reg'=column1,'rsquared'=column2) # the final data.frame
final
reg rsquared
1 col1 vs col2 0.14190543
2 col1 vs col3 0.51242469
3 col1 vs col4 0.05973700
4 col1 vs col5 0.05149017
5 col1 vs col6 0.37621382
6 col1 vs col7 0.14208468
7 col1 vs col8 0.38533983
8 col1 vs col9 0.26596917
9 col1 vs col10 0.01758616
Upvotes: 0