Reputation:
I have a data set of about 500 stocks with their returns. I was wondering if there was a way to generate multiple regression outputs, a 1 to 1 relationship, for a dependent variable to one independent?
For example, here is a simple regression output for one of the stocks MSFT compared to the market. Instead of creating multiple lines like the ones I have below, is there a way to put each of the 500 variables into the Y component of the formula and generate a new output for each? Maybe even go as far as put all this data into one table?
These are the variables that I am trying to group to ease my process:
regression_model <- lm(raw_data$MSFT~raw_data$VFINX, raw_data)
summary(regression_model)
Upvotes: 1
Views: 657
Reputation: 33772
One approach to this (which is sure to be a duplicate answer somewhere?) is:
Your example data is not usable, so let's use mtcars
. Assume we want to predict mpg
using each of disp
, hp
, drat
, wt
and qsec
. First we select, gather and nest:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
mtcars %>%
select(mpg, disp, hp, drat, wt, qsec) %>%
gather(Var, Val, -mpg) %>%
nest(data = c(mpg, Val))
Result:
# A tibble: 5 x 2
Var data
<chr> <list>
1 disp <tibble [32 x 2]>
2 hp <tibble [32 x 2]>
3 drat <tibble [32 x 2]>
4 wt <tibble [32 x 2]>
5 qsec <tibble [32 x 2]>
Now we can map
each row to a regression, and create the column of tidied output:
mtcars %>%
select(mpg, disp, hp, drat, wt, qsec) %>%
gather(Var, Val, -mpg) %>%
nest(data = c(mpg, Val)) %>%
mutate(model = map(data, ~lm(mpg ~ Val, data = .)),
tidied = map(model, tidy))
# A tibble: 5 x 4
Var data model tidied
<chr> <list> <list> <list>
1 disp <tibble [32 x 2]> <lm> <tibble [2 x 5]>
2 hp <tibble [32 x 2]> <lm> <tibble [2 x 5]>
3 drat <tibble [32 x 2]> <lm> <tibble [2 x 5]>
4 wt <tibble [32 x 2]> <lm> <tibble [2 x 5]>
5 qsec <tibble [32 x 2]> <lm> <tibble [2 x 5]>
And finally, select the columns we want and unnest
:
mtcars %>%
select(mpg, disp, hp, drat, wt, qsec) %>%
gather(Var, Val, -mpg) %>%
nest(data = c(mpg, Val)) %>%
mutate(model = map(data, ~lm(mpg ~ Val, data = .)),
tidied = map(model, tidy)) %>%
select(-model, -data) %>%
unnest(cols = c(tidied))
Result:
# A tibble: 10 x 6
Var term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 disp (Intercept) 29.6 1.23 24.1 3.58e-21
2 disp Val -0.0412 0.00471 -8.75 9.38e-10
3 hp (Intercept) 30.1 1.63 18.4 6.64e-18
4 hp Val -0.0682 0.0101 -6.74 1.79e- 7
5 drat (Intercept) -7.52 5.48 -1.37 1.80e- 1
6 drat Val 7.68 1.51 5.10 1.78e- 5
7 wt (Intercept) 37.3 1.88 19.9 8.24e-19
8 wt Val -5.34 0.559 -9.56 1.29e-10
9 qsec (Intercept) -5.11 10.0 -0.510 6.14e- 1
10 qsec Val 1.41 0.559 2.53 1.71e- 2
You could add filters from dplyr::filter()
, for example to remove the Intercept rows, or select a p-value threshold.
mtcars %>%
select(mpg, disp, hp, drat, wt, qsec) %>%
gather(Var, Val, -mpg) %>%
nest(data = c(mpg, Val)) %>%
mutate(model = map(data, ~lm(mpg ~ Val, data = .)),
tidied = map(model, tidy)) %>%
select(-model, -data) %>%
unnest(cols = c(tidied)) %>%
filter(p.value < 0.01,
term != "(Intercept)")
# A tibble: 5 x 6
Var term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 disp Val -0.0412 0.00471 -8.75 9.38e-10
2 hp Val -0.0682 0.0101 -6.74 1.79e- 7
3 drat Val 7.68 1.51 5.10 1.78e- 5
4 wt Val -5.34 0.559 -9.56 1.29e-10
Upvotes: 1