Dries
Dries

Reputation: 514

survfit for stratified cox-model

I have a stratified cox-model and want predicted survival-curves for certain profiles, based on that model.

Now, because I'm working with a large dataset with a lot of strata, I want predictions for very specific strata only, to save time and memory.

The help-page of survfit.coxph states: ... If newdata does contain strata variables, then the result will contain one curve per row of newdata, based on the indicated stratum of the original model.

When I run the code below, where newdata does contain the stratum-variable, I still get predictions for both strata, which contradicts the help-page

df <- data.frame(X1 = runif(200),
             X2 = sample(c("A", "B"), 200, replace = TRUE),
             Ev = sample(c(0,1), 200, replace = TRUE),
             Time = rexp(200))

testfit <- coxph( Surv(Time, Ev) ~ X1 + strata(X2), df)

out <- survfit(testfit, newdata = data.frame(X1 = 0.6, X2 = "A"))

Is there anything I fail to see or understand here?

Upvotes: 3

Views: 1940

Answers (2)

Dries
Dries

Reputation: 514

Some comments that may be helpfull:

  • My example was not big enough (and I seem not to have read the related github post very well, but that was after I posted my question here): if newdata has at least two lines (and of course the strata-variable), predictions are returned only for the requested strata

  • There is an inefficiency inside survfit.coxph, where the baseline-hazard is calculated for every stratum in the original dataset, not only for the requested strata (see my contribution to the same github post). However, that doesn't seem to be a big issue (a test on a dataset with roughly half a million observation, 50% events and 1000 strata), takes less than a minute

  • The problem is memory allocation somewhere during calculations (in the above example, things collapse once I want predictions for 100 observations - 1 stratum each - while the final output of predictions for 80 is only a few MB)

  • My work-around:

    • Select all observations you want predictions for
    • use lp <- predict(..., type='lp') to get the linear predictor for all these observations
    • use survfit only on the first observation: survfit(fit, newdata = expand_grid(newdf, strat = strata_list))
    • Store the resulting survival estimates in a data.frame (or not, that's up to you)
    • To calculate predicted survival for other observations, use the PH-assumption (see formula below). This invokes the overhead of survfit.coxph only once and if you focus on survival on only a few times (e.g. 5- and 10-year survival), you can reduce the computer time even more

enter image description here

Upvotes: 1

CSJCampbell
CSJCampbell

Reputation: 2115

I'm not sure if this is a bug or a feature in survival:::survfit.coxph. It looks like the intended behaviour in the code is that only requested strata are returned. In the function:

  • strata(X2) is evaluated in an environment containing newdata and the result, A is returned.
  • The full curve is then created.
  • There is then some logic to split the curve into strata, but only if result$surv is a matrix.

In your example it is not a matrix. I can't find any documentation on the expected usage of this if it's not a bug. Perhaps it would be worth dropping the author/maintainer a note.

maintainer("survival")
# [1] "Terry M Therneau <[email protected]>"

Upvotes: 1

Related Questions