Different P.values when using group_by followed by lm() compared to just lm() only

Question

Could I please get some help on the following. I have a data frame which has multiple groups which I would like to run a linear model on. As a test, I subset just one of the groups and ran the function lm() and got the following out put:

test <- filter(dat, locus == "ChrX_1")
test.result <- lm(methylation ~ Pheno, dat)

              term estimate  std.error statistic    p.value
1 (Intercept)   56.955      0.9729203 58.540254  9.080525e-250
2      Pheno1    9.015      1.1915791  7.565591  1.464884e-13

I then used group_by from dplyr package to perform the lm() function on the different groups. But the output of the p.value of the locus "ChrX_1" is now different and weaker.

test.result4 <- group_by(dat, locus) %>%
  do(model.test2 = lm(methylation ~ Pheno, data = .))
tidy(test.result4, model.test2)  

    locus        term estimate std.error statistic      p.value
                                 
1   ChrX_1 (Intercept)    59.40  4.476666 13.268804 1.342225e-13
2   ChrX_1      Pheno1     9.05  5.482773  1.650624 1.099895e-01
3  ChrX_10 (Intercept)    59.00  4.069398 14.498459 1.522725e-14
4  ChrX_10      Pheno1    11.40  4.983974  2.287331 2.993721e-02
5  ChrX_11 (Intercept)    58.90  4.665565 12.624408 4.460131e-13
6  ChrX_11      Pheno1     9.10  5.714127  1.592544 1.224905e-01
7  ChrX_12 (Intercept)    52.80  3.717022 14.204921 2.526739e-14
8  ChrX_12      Pheno1    10.65  4.552403  2.339424 2.667444e-02
9  ChrX_13 (Intercept)    53.10  3.556734 14.929427 7.343091e-15
10 ChrX_13      Pheno1     7.10  4.356092  1.629901 1.143224e-01
# ... with 30 more rows

As such, I was wondering what is causing the weakening of the p.values? I thought the p.value should be the same as when I had subsetted the locus and ran the lm() function on it.

Thanks

Kumar Manglam · Accepted Answer

As i mentioned in the comment, the issue is that you are not using the filtered data, instead you are using the entire dataset. Hence the mis-match.

Below is the code, with sample data, that shows no mismatch when using group_by and lm on it.

library(dplyr)
library(tidyr)
library(broom)

set.seed(123)
dat <- data.frame(methylation=runif(1000, min=10, max=200), 
  Pheno=runif(1000, min=10, max=200), 
  locus=sample(paste0("ChrX_", 1:10), 1000, replace=TRUE)
  )
dat$locus <- as.character(dat$locus)

test <- filter(dat, locus == "ChrX_1")
test.result <- lm(methylation ~ Pheno, test)
summary(test.result)

test.result4 <- group_by(dat, locus) %>%
  do(model.test2 = lm(methylation ~ Pheno, data = .))
tidy(test.result4, model.test2)

Different P.values when using group_by followed by lm() compared to just lm() only

Answers (2)

Related Questions