psboonstra
psboonstra

Reputation: 141

Fitted coxph model varies by ordering of terms in formula with time-varying strata and clusters

I encountered strange behavior in the survival package in R, and before I reach out to the package maintainer directly, I'm hoping folks smarter than me can tell me if it's not a bug but rather something that is more statistically nuanced (hence my posting here).

The context is that I want to fit a Cox model with time-varying coefficients based upon time strata, and I also want robust standard errors to account for clustered observations. Surprisingly to me, the output varies depending upon the ordering of terms in the formula argument, and it seems to selectively ignore interactions, depending on whether they come before or after the cluster argument.

Please see the reprex below, including my comments that point out the things that look strange to me. Thanks

PS A bonus question: do you know why, in the expected_fit below, age comes before strata(tgroup) in the variable name but ph.ecog comes after strata(tgroup)?

library(survival);
library(tidyverse);

# Create a fake cluster in the lung data
lung_expanded <- 
  lung %>%
  mutate(fake_id = sample(1:3, nrow(lung), replace = T))


# Model below is as expected: I get the expected time-stratified hazard 
# ratios and robust standard errors 

expected_fit <- 
  survSplit(Surv(time, status) ~ ., 
            data = lung_expanded, 
            cut = c(200, 300), 
            episode = "tgroup") %>%
  coxph(formula = Surv(tstart, time, status) ~ 
          # putting cluster prior to age:time interaction
          # gives expected results
          cluster(fake_id) + age:strata(tgroup) + ph.ecog:strata(tgroup))
expected_fit
#> Call:
#> coxph(formula = Surv(tstart, time, status) ~ age:strata(tgroup) + 
#>     strata(tgroup):ph.ecog, data = ., cluster = fake_id)
#> 
#>                                     coef exp(coef)  se(coef) robust se      z
#> age:strata(tgroup)tgroup=1      0.007912  1.007943  0.013858  0.010575  0.748
#> age:strata(tgroup)tgroup=2     -0.003444  0.996562  0.021995  0.014347 -0.240
#> age:strata(tgroup)tgroup=3      0.020987  1.021209  0.015680  0.006119  3.430
#> strata(tgroup)tgroup=1:ph.ecog  0.665031  1.944551  0.176550  0.177987  3.736
#> strata(tgroup)tgroup=2:ph.ecog  0.650765  1.917007  0.278808  0.032228 20.193
#> strata(tgroup)tgroup=3:ph.ecog  0.097243  1.102129  0.188521  0.131935  0.737
#>                                       p
#> age:strata(tgroup)tgroup=1     0.454370
#> age:strata(tgroup)tgroup=2     0.810300
#> age:strata(tgroup)tgroup=3     0.000604
#> strata(tgroup)tgroup=1:ph.ecog 0.000187
#> strata(tgroup)tgroup=2:ph.ecog  < 2e-16
#> strata(tgroup)tgroup=3:ph.ecog 0.461090
#> 
#> Likelihood ratio test=24.9  on 6 df, p=0.0003561
#> n= 462, number of events= 164 
#>    (1 observation deleted due to missingness)

# Model output below differs from above, but it varies only in the ordering of the
# formula. I only get the hazard ratios  corresponding to  age but not 
# those corresponding ecog status. moreover, I get an unexpected hazard 
# ratio corresponding to a newly created variable called 'cluster(fake_id)'

suspicious_fit <- 
  survSplit(Surv(time, status) ~ ., 
            data = lung_expanded, 
            cut = c(200, 300), 
            episode = "tgroup") %>%
  coxph(formula = Surv(tstart, time, status) ~ 
          # putting age:time interaction prior to the cluster 
          # gives suspicious results
          age:strata(tgroup) + cluster(fake_id) + ph.ecog:strata(tgroup))
suspicious_fit
#> Call:
#> coxph(formula = Surv(tstart, time, status) ~ cluster(fake_id) + 
#>     age:strata(tgroup), data = ., cluster = fake_id)
#> 
#>                                coef exp(coef) se(coef) robust se      z       p
#> cluster(fake_id)           -0.04020   0.96060  0.09952   0.05094 -0.789 0.43004
#> age:strata(tgroup)tgroup=1  0.01928   1.01947  0.01354   0.01477  1.306 0.19166
#> age:strata(tgroup)tgroup=2  0.01029   1.01034  0.02149   0.01782  0.578 0.56359
#> age:strata(tgroup)tgroup=3  0.02254   1.02279  0.01556   0.00655  3.441 0.00058
#> 
#> Likelihood ratio test=4.62  on 4 df, p=0.3288
#> n= 463, number of events= 165

# In the model below, which has no time-varying strata, 
# everything looks good. 

expected_fit2  <- 
  lung_expanded %>%
  coxph(formula = Surv(time, status) ~ 
          # putting age:time interaction prior to the cluster 
          # gives suspicious results
          age + cluster(fake_id) + ph.ecog)
expected_fit2
#> Call:
#> coxph(formula = Surv(time, status) ~ age + ph.ecog, data = ., 
#>     cluster = fake_id)
#> 
#>             coef exp(coef) se(coef) robust se     z      p
#> age     0.011281  1.011345 0.009319  0.006714  1.68 0.0929
#> ph.ecog 0.443485  1.558128 0.115831  0.028093 15.79 <2e-16
#> 
#> Likelihood ratio test=19.06  on 2 df, p=7.279e-05
#> n= 227, number of events= 164 
#>    (1 observation deleted due to missingness)

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.6     purrr_0.3.4    
#>  [5] readr_1.4.0     tidyr_1.1.3     tibble_3.1.2    ggplot2_3.3.3  
#>  [9] tidyverse_1.3.1 survival_3.2-11
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.1  xfun_0.23         splines_4.1.0     haven_2.4.1      
#>  [5] lattice_0.20-44   colorspace_2.0-1  vctrs_0.3.8       generics_0.1.0   
#>  [9] htmltools_0.5.1.1 yaml_2.2.1        utf8_1.2.1        rlang_0.4.11     
#> [13] pillar_1.6.1      glue_1.4.2        withr_2.4.2       DBI_1.1.1        
#> [17] dbplyr_2.1.1      readxl_1.3.1      modelr_0.1.8      lifecycle_1.0.0  
#> [21] cellranger_1.1.0  munsell_0.5.0     gtable_0.3.0      rvest_1.0.0      
#> [25] evaluate_0.14     knitr_1.33        fansi_0.5.0       highr_0.9        
#> [29] Rcpp_1.0.7        broom_0.7.6       backports_1.2.1   scales_1.1.1     
#> [33] jsonlite_1.7.2    fs_1.5.0          hms_1.1.0         digest_0.6.27    
#> [37] stringi_1.6.2     grid_4.1.0        cli_2.5.0         tools_4.1.0      
#> [41] magrittr_2.0.1    crayon_1.4.1      pkgconfig_2.0.3   ellipsis_0.3.2   
#> [45] Matrix_1.3-4      xml2_1.3.2        reprex_2.0.0      lubridate_1.7.10 
#> [49] assertthat_0.2.1  rmarkdown_2.8     httr_1.4.2        rstudioapi_0.13  
#> [53] R6_2.5.0          compiler_4.1.0
Created on 2021-08-16 by the reprex package (v2.0.0)
```

Upvotes: 1

Views: 263

Answers (1)

psboonstra
psboonstra

Reputation: 141

I drilled down the survival::coxph code, identified the offending line of code, and reached out to Terry Therneau (the package maintainer) who confirmed that this is a bug. He noted there are "...bugs in [.formula in base R, which get inherited by update.formula and reformulate." He has informed me that version 3.2-13 fixes this issue.

Upvotes: 3

Related Questions