Reputation: 141
I encountered strange behavior in the survival
package in R, and before I reach out to the package maintainer directly, I'm hoping folks smarter than me can tell me if it's not a bug but rather something that is more statistically nuanced (hence my posting here).
The context is that I want to fit a Cox model with time-varying coefficients based upon time strata, and I also want robust standard errors to account for clustered observations. Surprisingly to me, the output varies depending upon the ordering of terms in the formula
argument, and it seems to selectively ignore interactions, depending on whether they come before or after the cluster argument.
Please see the reprex below, including my comments that point out the things that look strange to me. Thanks
PS A bonus question: do you know why, in the expected_fit
below, age
comes before strata(tgroup)
in the variable name but ph.ecog
comes after strata(tgroup)
?
library(survival);
library(tidyverse);
# Create a fake cluster in the lung data
lung_expanded <-
lung %>%
mutate(fake_id = sample(1:3, nrow(lung), replace = T))
# Model below is as expected: I get the expected time-stratified hazard
# ratios and robust standard errors
expected_fit <-
survSplit(Surv(time, status) ~ .,
data = lung_expanded,
cut = c(200, 300),
episode = "tgroup") %>%
coxph(formula = Surv(tstart, time, status) ~
# putting cluster prior to age:time interaction
# gives expected results
cluster(fake_id) + age:strata(tgroup) + ph.ecog:strata(tgroup))
expected_fit
#> Call:
#> coxph(formula = Surv(tstart, time, status) ~ age:strata(tgroup) +
#> strata(tgroup):ph.ecog, data = ., cluster = fake_id)
#>
#> coef exp(coef) se(coef) robust se z
#> age:strata(tgroup)tgroup=1 0.007912 1.007943 0.013858 0.010575 0.748
#> age:strata(tgroup)tgroup=2 -0.003444 0.996562 0.021995 0.014347 -0.240
#> age:strata(tgroup)tgroup=3 0.020987 1.021209 0.015680 0.006119 3.430
#> strata(tgroup)tgroup=1:ph.ecog 0.665031 1.944551 0.176550 0.177987 3.736
#> strata(tgroup)tgroup=2:ph.ecog 0.650765 1.917007 0.278808 0.032228 20.193
#> strata(tgroup)tgroup=3:ph.ecog 0.097243 1.102129 0.188521 0.131935 0.737
#> p
#> age:strata(tgroup)tgroup=1 0.454370
#> age:strata(tgroup)tgroup=2 0.810300
#> age:strata(tgroup)tgroup=3 0.000604
#> strata(tgroup)tgroup=1:ph.ecog 0.000187
#> strata(tgroup)tgroup=2:ph.ecog < 2e-16
#> strata(tgroup)tgroup=3:ph.ecog 0.461090
#>
#> Likelihood ratio test=24.9 on 6 df, p=0.0003561
#> n= 462, number of events= 164
#> (1 observation deleted due to missingness)
# Model output below differs from above, but it varies only in the ordering of the
# formula. I only get the hazard ratios corresponding to age but not
# those corresponding ecog status. moreover, I get an unexpected hazard
# ratio corresponding to a newly created variable called 'cluster(fake_id)'
suspicious_fit <-
survSplit(Surv(time, status) ~ .,
data = lung_expanded,
cut = c(200, 300),
episode = "tgroup") %>%
coxph(formula = Surv(tstart, time, status) ~
# putting age:time interaction prior to the cluster
# gives suspicious results
age:strata(tgroup) + cluster(fake_id) + ph.ecog:strata(tgroup))
suspicious_fit
#> Call:
#> coxph(formula = Surv(tstart, time, status) ~ cluster(fake_id) +
#> age:strata(tgroup), data = ., cluster = fake_id)
#>
#> coef exp(coef) se(coef) robust se z p
#> cluster(fake_id) -0.04020 0.96060 0.09952 0.05094 -0.789 0.43004
#> age:strata(tgroup)tgroup=1 0.01928 1.01947 0.01354 0.01477 1.306 0.19166
#> age:strata(tgroup)tgroup=2 0.01029 1.01034 0.02149 0.01782 0.578 0.56359
#> age:strata(tgroup)tgroup=3 0.02254 1.02279 0.01556 0.00655 3.441 0.00058
#>
#> Likelihood ratio test=4.62 on 4 df, p=0.3288
#> n= 463, number of events= 165
# In the model below, which has no time-varying strata,
# everything looks good.
expected_fit2 <-
lung_expanded %>%
coxph(formula = Surv(time, status) ~
# putting age:time interaction prior to the cluster
# gives suspicious results
age + cluster(fake_id) + ph.ecog)
expected_fit2
#> Call:
#> coxph(formula = Surv(time, status) ~ age + ph.ecog, data = .,
#> cluster = fake_id)
#>
#> coef exp(coef) se(coef) robust se z p
#> age 0.011281 1.011345 0.009319 0.006714 1.68 0.0929
#> ph.ecog 0.443485 1.558128 0.115831 0.028093 15.79 <2e-16
#>
#> Likelihood ratio test=19.06 on 2 df, p=7.279e-05
#> n= 227, number of events= 164
#> (1 observation deleted due to missingness)
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6 purrr_0.3.4
#> [5] readr_1.4.0 tidyr_1.1.3 tibble_3.1.2 ggplot2_3.3.3
#> [9] tidyverse_1.3.1 survival_3.2-11
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.1 xfun_0.23 splines_4.1.0 haven_2.4.1
#> [5] lattice_0.20-44 colorspace_2.0-1 vctrs_0.3.8 generics_0.1.0
#> [9] htmltools_0.5.1.1 yaml_2.2.1 utf8_1.2.1 rlang_0.4.11
#> [13] pillar_1.6.1 glue_1.4.2 withr_2.4.2 DBI_1.1.1
#> [17] dbplyr_2.1.1 readxl_1.3.1 modelr_0.1.8 lifecycle_1.0.0
#> [21] cellranger_1.1.0 munsell_0.5.0 gtable_0.3.0 rvest_1.0.0
#> [25] evaluate_0.14 knitr_1.33 fansi_0.5.0 highr_0.9
#> [29] Rcpp_1.0.7 broom_0.7.6 backports_1.2.1 scales_1.1.1
#> [33] jsonlite_1.7.2 fs_1.5.0 hms_1.1.0 digest_0.6.27
#> [37] stringi_1.6.2 grid_4.1.0 cli_2.5.0 tools_4.1.0
#> [41] magrittr_2.0.1 crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.2
#> [45] Matrix_1.3-4 xml2_1.3.2 reprex_2.0.0 lubridate_1.7.10
#> [49] assertthat_0.2.1 rmarkdown_2.8 httr_1.4.2 rstudioapi_0.13
#> [53] R6_2.5.0 compiler_4.1.0
Created on 2021-08-16 by the reprex package (v2.0.0)
```
Upvotes: 1
Views: 263
Reputation: 141
I drilled down the survival::coxph
code, identified the offending line of code, and reached out to Terry Therneau (the package maintainer) who confirmed that this is a bug. He noted there are "...bugs in [.formula in base R, which get inherited by update.formula and reformulate." He has informed me that version 3.2-13 fixes this issue.
Upvotes: 3