Reputation: 1243
I have a following survival data
library(survival)
data(pbc)
#model to be plotted and analyzed, convert time to years
fit <- survfit(Surv(time/365.25, status) ~ edema, data = pbc)
#visualize overall survival Kaplan-Meier curve
plot(fit)
Here is how the resulting Kaplan-Meier plot looks like
I am further calculating survival at 1, 2, 3 years in this manner:
> summary(fit,times=c(1,2,3))
Call: survfit(formula = Surv(time/365.25, status) ~ edema, data = pbc)
232 observations deleted due to missingness
edema=0
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 126 12 0.913 0.0240 0.867 0.961
2 112 12 0.825 0.0325 0.764 0.891
3 80 26 0.627 0.0420 0.550 0.714
edema=0.5
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 22 7 0.759 0.0795 0.618 0.932
2 17 5 0.586 0.0915 0.432 0.796
3 11 4 0.448 0.0923 0.299 0.671
edema=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 8 11 0.421 0.1133 0.2485 0.713
2 5 3 0.263 0.1010 0.1240 0.558
3 3 2 0.158 0.0837 0.0559 0.446
As you can see, the resulting output shows me 95% confidence intervals between different levels of edema
but no actual p values. Whether confidence intervals overlap or not, I still get a pretty good idea whether survival at these time points are signifiantly different or not, but I would like to have exact p values. How can I do that?
Upvotes: 11
Views: 11320
Reputation: 268
I think the following code does what you are looking for:
library(survival)
data(pbc)
#model to be plotted and analyzed, convert time to years
fit <- survfit(Surv(time/365.25, status) ~ edema, data = pbc)
#visualize overall survival Kaplan-Meier curve
plot(fit)
threeYr <- summary(fit,times=3)
#difference in survival at 3 years between edema=0 and edemo=1 (for example) is
threeYr$surv[1] - threeYr$surv[3]
#the standard error of this is
diffSE <- sqrt(threeYr$std.err[3]^2 + threeYr$std.err[1]^2)
#a 95% CI for the diff is
threeYr$surv[1] - threeYr$surv[3] - 1.96 *diffSE
threeYr$surv[1] - threeYr$surv[3] + 1.96 *diffSE
#a z-test test statistic is
zStat <- (threeYr$surv[1] - threeYr$surv[3])/diffSE
#and a two-sided p-value testing that the diff. is 0 is
2*pnorm(abs(zStat), lower.tail=FALSE)
Alternatively one could make the comparison by estimating the risk ratio or odds ratio based on the estimated probabilities, and perform the inference/test on the log risk ratio or log odds ratio scale. In general I would expect this to perform better (in terms of test size and confidence interval coverage) since the normal approximation will be better on these scales than on the risk difference scale.
Upvotes: 8
Reputation: 4970
Your question is 'are x-year survival rates different for the different categories of edema'.
For example, if you're interested in 3-year survival rates; you only need to focus on that portion of the curve (first 3 years of follow-up), as shown in the figure. The follow-up time for patients that are still alive after 3 years is set to 3 years (i.e., maximum follow-up time in this analysis):pbc$time[pbc$time > 3*365.25] <- 3*365.25
.
Calculating a log-rank test using coxph
in the package 'survival' (same package you are already using in your analysis) for this data set will provide you the P-value that says whether survival at three years is different between the three groups (highly significant in this example). You can also use the same model to generate P-values and hazard ratios for the association of edema with cause-specific survival.
Upvotes: 5