Reputation: 514
I have a stratified cox-model and want predicted survival-curves for certain profiles, based on that model.
Now, because I'm working with a large dataset with a lot of strata, I want predictions for very specific strata only, to save time and memory.
The help-page of survfit.coxph
states: ... If newdata does contain strata variables, then the result will contain one curve per row of newdata, based on the indicated stratum of the original model.
When I run the code below, where newdata does contain the stratum-variable, I still get predictions for both strata, which contradicts the help-page
df <- data.frame(X1 = runif(200),
X2 = sample(c("A", "B"), 200, replace = TRUE),
Ev = sample(c(0,1), 200, replace = TRUE),
Time = rexp(200))
testfit <- coxph( Surv(Time, Ev) ~ X1 + strata(X2), df)
out <- survfit(testfit, newdata = data.frame(X1 = 0.6, X2 = "A"))
Is there anything I fail to see or understand here?
Upvotes: 3
Views: 1940
Reputation: 514
Some comments that may be helpfull:
My example was not big enough (and I seem not to have read the related github post very well, but that was after I posted my question here): if newdata
has at least two lines (and of course the strata-variable), predictions are returned only for the requested strata
There is an inefficiency inside survfit.coxph
, where the baseline-hazard is calculated for every stratum in the original dataset, not only for the requested strata (see my contribution to the same github post). However, that doesn't seem to be a big issue (a test on a dataset with roughly half a million observation, 50% events and 1000 strata), takes less than a minute
The problem is memory allocation somewhere during calculations (in the above example, things collapse once I want predictions for 100 observations - 1 stratum each - while the final output of predictions for 80 is only a few MB)
My work-around:
lp <- predict(..., type='lp')
to get the linear predictor for all these observationssurvfit(fit, newdata = expand_grid(newdf, strat = strata_list))
survfit.coxph
only once and if you focus on survival on only a few times (e.g. 5- and 10-year survival), you can reduce the computer time even moreUpvotes: 1
Reputation: 2115
I'm not sure if this is a bug or a feature in survival:::survfit.coxph
. It looks like the intended behaviour in the code is that only requested strata are returned. In the function:
strata(X2)
is evaluated in an environment containing newdata
and the result, A
is returned.split
the curve into strata, but only if result$surv
is a matrix.In your example it is not a matrix. I can't find any documentation on the expected usage of this if it's not a bug. Perhaps it would be worth dropping the author/maintainer a note.
maintainer("survival")
# [1] "Terry M Therneau <[email protected]>"
Upvotes: 1