Reputation: 13
I have a dataset structured as such: enter image description here
I would like to run linear regression models and ANOVA using V1, V2...etc. as the independent variables and the g column as the dependent variable in each case (i.e. lm(V1 ~ g), lm(V2 ~ g), and so forth). This would be straightforward except that these linear regressions need to be grouped by level in the pair column, such that, for example, my output contains lm(V1 ~ g) for all rows with pair 1.1 and lm(V1 ~ g) for all pairs 1.201, etc.
I've tried a number of approaches using for loops, lapply and the data.table package, and nothing gives me exactly the output I'd like. Can anyone give me any insight on the best way to tackle this problem?
Edit: My full data set has 7056 different pairs in the pair column and 100 V columns (V1...V100). My latest attempt at this problem:
df$pair <- as.factor(df$pair)
out <- list()
for (i in 3:ncol(df)){
out[[i]] <- lapply(levels(df$pair), function(x) {
data.frame(df=x, g = coef(summary(lm(df[,i]~ df$g, data=df[df$pair==x,])),row.names=NULL))})
}
Upvotes: 1
Views: 7209
Reputation: 5689
Let's get some tidyverse
power here, along with broom
, and forgo all these loops...
First I'll make a dummy table:
df <- data.frame(
g = runif(50),
pair = sample(x = c("A", "B", "C"), size = 50, replace = TRUE),
V1 = runif(50),
V2 = runif(50),
V3 = runif(50),
V4 = runif(50),
V5 = runif(50),
stringsAsFactors = FALSE
)
That's approximately what your data structure looks like. Now onto the meat of the code:
library(tidyverse)
library(broom)
df %>%
as_tibble %>%
gather(key = "column", value = "value", V1:V5) %>% # first set the data in long format
nest(g, value) %>% # now nest the dependent and independent factors
mutate(model = map(data, ~lm(g ~ value, data = .))) %>% # fit the model using purrr
mutate(tidy_model = map(model, tidy)) %>% # clean the model output with broom
select(-data, -model) %>% # remove the "untidy" parts
unnest() # get it back in a recognizable data frame
Which gives us the following:
# A tibble: 30 x 7
pair column term estimate std.error statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 C V1 (Intercept) 0.470 0.142 3.31 0.00561
2 C V1 value 0.125 0.265 0.472 0.645
3 B V1 (Intercept) 0.489 0.142 3.45 0.00359
4 B V1 value -0.0438 0.289 -0.151 0.882
5 A V1 (Intercept) 0.515 0.111 4.63 0.000279
6 A V1 value -0.00569 0.249 -0.0229 0.982
7 C V2 (Intercept) 0.367 0.147 2.50 0.0265
8 C V2 value 0.377 0.300 1.26 0.231
9 B V2 (Intercept) 0.462 0.179 2.59 0.0206
10 B V2 value 0.0175 0.322 0.0545 0.957
# … with 20 more rows
yep, that's a beaut! Note that I used lm(g ~ value)
instead of lm(value ~ g)
as this is what your text description alluded to.
Upvotes: 5
Reputation: 177
Using the tidyverse
package to filter your dataframe:
library(tidyverse)
lm(V1~g, data=filter(yourData, pair==1.1))
lm(V2~g, data=filter(yourData, pair==1.201))
This ensures that you are eliminating rows that don't contain your desired pair
value for each regression model. You could probably create a loop to do this, but I think it's easier to just manually keep filtering through the pair
values. If you really want to use a loop, here's a fairly simple way to do it:
for (i in levels(yourData$pair)) {
if (i==1.1) {
mod1 <- lm(V1~g, data=filter(yourData, pair==i))
}
if (i==1.201) {
mod2 <- lm(V2~g, data=filter(yourData, pair==i))
}
}
But this is still manually looping through the levels of pair
. I would have to see your whole data set to automate the looping process.
Also, if the g
column contains your dependent values, the call should be lm(g~V1)
, lm(g~V2)
, etc. It shouldn't be lm(V1~g)
.
Upvotes: 1