Bathsraikou
Bathsraikou

Reputation: 13

What is actually occurring in this multiple linear regression analysis done in R?

I am working on a project with data analysis in R.

What I am seeking to do is determine if a dataset can be described with a linear regression model, and then trying to test if certain subgroups of that dataset have a stronger correlation than the whole dataset. More specifically, I am comparing a dataset where students recorded their pulse and time estimations, and checking to see if there is a stronger correlation in a subgroup of the data where students were not found to have a daily rhythm to either variable vs. a subgroup where students were calculated to have a daily rhythm in both time estimation and heart rate. The values I am using are their daily averages for both time estimation and heart rate.

I ran a linear model of the whole dataset:

> summary(ptmod1)
Call:
lm(formula = avg.time ~ avg.pulse, data = pulsetime)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.7310  -1.6725  -0.0162   2.0134   9.8548 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 68.82047    2.99244  22.998   <2e-16 ***
avg.pulse   -0.10449    0.04115  -2.539   0.0125 *

and also attempted to run a linear regression of each subgroup

> summary(ptmod2)

Call:
lm(formula = avg.time ~ avg.pulse + Group, data = pulsetime)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9884  -1.7723  -0.1873   2.4900   8.7424 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 68.45350    2.92287  23.420  < 2e-16 ***
avg.pulse   -0.08566    0.03985  -2.149  0.03388 *  
GroupOne    -1.22325    0.91444  -1.338  0.18386    
GroupThree   0.11062    0.97666   0.113  0.91003    
GroupTwo    -3.09096    0.95446  -3.238  0.00161 **

However, I wanted to make sure that what I was seeing was correct, because I did not really expect so many of the groups to have significant coefficients. So I cut the groups up into their own .csv files and generated linear models for each of them individually. Cutting them up into their own files also made it easier to run a Chow test as a post-hoc analysis. When I ran regressions on them again, I got quite different coefficients. For example, here is the summary for Group One:

> summary(mod1)

Call:
lm(formula = avg.time ~ avg.pulse, data = group1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.1048 -1.6529 -0.7279  1.4063  5.6574 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 67.41445    4.15917  16.209 8.99e-15 ***
avg.pulse   -0.08916    0.05657  -1.576    0.128  

This makes me question what exactly my results from the summary of ptmod2 actually mean? I was uncertain of how to set up the R code for the linear model sorted by individual subgroups, so my code for it was

> ptmod2<-lm(avg.time~avg.pulse+Group, data=pulsetime)

In my spreadsheet file, I have three columns: avg.pulse, avg.time, and Group. "Group" is a column of the words "One", "Two", "Three", and "Four" assigned based on subgroup.

Did the summary for ptmod2 fit a linear regression across the whole dataset? I am really not sure what happened.

Thankyou so much for any insight you can provide. Perhaps my code for comparing regressions by group was incorrect.

Upvotes: 1

Views: 295

Answers (1)

Oliver
Oliver

Reputation: 8572

This is somewhat of a split between a programming and statistics question. It is maybe better suited for crossvalidation. However the question is simple enough to get an understanding about.

Your question can be split into the following sub-questions:

  1. Am I fitting a model on the full (whole) dataset in ptmod2?
  2. How do I estimate multiple models across grouped datasets?
  3. What is the correct way to analyse the coefficients of such a situation?

Am I fitting a model on the full (whole) datset in ptmod2?

The long and short is "yes". In R and statistics, when you add a "group" variable to your dataset, this is not equivalent to splitting your dataset into multiple groups. Instead it is adding an indicator variable (0 or 1) indicating the specific groups, including a reference level. So in your case you have 4 groups, 1 through 4, and you are adding an indicator for whether someone is in group 1, group 2, group 3 or (reference level) group 4. This is a measure of how much the intercept differs between the groups. Eg. these variables have the interpretation:

If the models share a common slope avg.pulse are there a significant difference in the avg.time explained by the specific group?

The reason why you see only 3 groups and not 4, is that the fourth group is explained by setting all the other groups equal to FALSE. Eg. if you are not in group 1, 2 or 3 you are part of group 4. So the "effect" of being in group 4, is the effect of not being in group 1, 2 or 3 (in this case).

A method for studying this, that it seems many of my students found helpful, is to study a small version of the model.matrix for example:

data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)

where you can very actively see that there is a column for the (intercept), hp and 2 columns for cyl6 and cyl8 (no column for cyl4 which is the reference). Matching the indices in cyl6 and cyl8 to the value in mtcars illustrates that a 1 in the cyl6 column indicates that cyl == 6.

How do I estimate multiple models across grouped datasets?

There are multiple methods for doing this depending on the question being sought. In your case you seem interested in the question "Are there a significant difference in the effect of avg.pulse depending for each group?". Eg, you want to estimate the avg.pulse coefficient for each group. One is to do as you did later, and estimate a model across each group

groups <- split(pulsetime, pulsetime$Group)
models <- lapply(groups, function(df)lm(avg.time ~ avg.pulse, data = df))
lapply(models, summary)

which gives the estimate. The problem here is "how to compare these". There are methods for doing so, by comparing the covariance between the parameters between each model, which is called "multivariate statistical analysis" or multiple regression models. This is overly complicated however, as the models share a common outcome.

A much simpler method is to incorporate the different estimate by adding the "extra" effect for each group using indicator variables. This works similar to adding the group variable, but instead of adding it alone (indicating the effect of being in group X), we multiply it to the variable in question using one of

# Let each group have their own `avg.pulse` variable
ptmod2_1 <- lm(formula = avg.time ~ avg.pulse : Group, data = pulsetime)
# Let each group have their own `avg.pulse` variable and account for the effect of `Group`
ptmod2_2 <- lm(formula = avg.time ~ avg.pulse * Group, data = pulsetime)

In the prior you'll see avg.time:GroupX, for all 4 groups, meaning these are the "effect of avg.time in group X", while in the latter you'll one again have a reference level. Note a stark difference between the 2 methods is that in the latter all group have the same intercept while in the latter all groups can have a different intercept.
In general statistics the latter is the preferred method, unless you have a very good reason not to expect each group to have a different average. It is very similar to the rule of thumb: Don't test your intercept, unless you have a very good reason, and even then you probably shouldn't". Basically because it makes a lot of logical sense to follow those rules (though it can take a few days of reflecting to realize why).

What is the correct way to analyse the coefficients of such a situation?

If you've stuck with one of the 2 latter methods, the analysis is similar to a normal regression analysis. Coefficients can be tested using t-tests, anova and so on (using summary/drop1 and anova), and if you have a reason you can test group merging using standard test as well (although if they are insignificant there is rarely a reason to merge them either way). The whole trick becomes "how do I interpret the coefficients".

For method 1 it is glaringly obvious. "Group 1 has an effect of avg.pulse of so much" and so on. For method 2 it is slightly more subtle. The effect of avg.pulse in group 1 is avg.pulse + avg.pulse:GroupOne. Because you have to note that avg.pulsedoes **not** disappear when you change group. It is the reference level, and every other effect is the **additional** effect onavg.pulseof going from being in group X to being in group Y. Visually yourslope` changes in the graph becoming steeper(flatter) if the coefficient is positive(negative).

I've given a visualization below using the mtcars dataset with using mpg for outcome, hp for numeric variable and cyl (as a factor) as a grouping variable. Confidence intervals are removed as they are not important for the illustration. The important part is to note how different the 2 models are (cyl == 4 is positive in one, negative in the other!). This further goes along the idea why method 2 is often "more correct" than the prior.

Interaction plot

Code for reproducibility

Below is the code I've used for my illustrations and examples

data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model.matrix(mpg ~ hp + cyl, data = mtcars)

#split-fit across groups and run summary
groups <- split(mtcars, mtcars$cyl)
models <- lapply(groups, function(df)lm(mpg ~ hp, data = df))
lapply(models, summary)

#Fit using interaction effects
fit_11 <- lm(mpg ~ hp:cyl , mtcars)
fit_12 <- lm(mpg ~ hp*cyl , mtcars)
summary(fit_11)
summary(fit_12)

#Illustrate interaction effects
library(sjPlot)
library(sjmisc)
library(ggplot2)
library(patchwork)
theme_set(theme_sjplot())
p1 <- plot_model(fit_11, type = "pred", terms = c("hp","cyl"), ci.lvl = 0) + ggtitle("Same intercept different slope") + 
  geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p2 <- plot_model(fit_12, type = "pred", terms = c("hp", "cyl"), ci.lvl = 0) + ggtitle("Different intercept and slope") + 
  geom_point(aes(x = hp, y = mpg, col = cyl, fill = cyl), mtcars, inherit.aes = FALSE)
p1 / p2 

Upvotes: 2

Related Questions