svannoy
svannoy

Reputation: 2005

How to resolve discrepancy in 95% confidence intervals in survival function in R

I'm in the process of writing some functions to extract information from the results of a survival analysis and I ran into a discrepancy between my extraction of the lower and upper survival time as specified by the 95% confidence interval and that which is reported from the package itself as a summary.

I'm using the survival package (v 2.37-7) in R (v 3.1.2).

So my problem is that sometimes my extraction of the lower and/or upper boundary of the 95% CI for the median survival time does not match what is returned when I just evaluate the results of survfit. When I inspect the data, I believe the results of survfit are wrong, it appears to be returning the boundary+1 value (again, sometimes). Here are some data that illustrate the problem.

# Fit my data stratified by gender of subject
survFit30Sex <- survfit(Surv(thirtyDaySuicides$daysFromInvestigation) ~ thirtyDaySuicides$Sex)

# Display median survival and confidence interval
survFit30Sex


Call: survfit(formula = Surv(thirtyDaySuicides$daysFromInvestigation) ~ 
thirtyDaySuicides$Sex)

                    records n.max n.start events median 0.95LCL 0.95UCL
thirtyDaySuicides$Sex=1      35    35      35     35     15       9      20
thirtyDaySuicides$Sex=2      93    93      93     93      9       6      13

survfit determines the lower and upper boundary for Sex = 1 as 9 days and 20 days respectively but when I inspect the data, it seems that the upper boundary should be 19, not 20

Here is the actual data; I'm just showing for Sex=1 as that is where the discrepancy is, I've also cut out values well before and after the critical region to make the data easier to read

Call: survfit(formula = Surv(thirtyDaySuicides$daysFromInvestigation) ~ 
    thirtyDaySuicides$Sex)

summary( thirtyDaySuicides$Sex=1 )
     time n.risk n.event survival std.err lower 95% CI upper 95% CI
    9     24       2   0.6286  0.0817      0.48725        0.811
   10     22       1   0.6000  0.0828      0.45780        0.786
   11     21       1   0.5714  0.0836      0.42890        0.761
   13     20       1   0.5429  0.0842      0.40055        0.736
   14     19       1   0.5143  0.0845      0.37272        0.710
   15     18       1   0.4857  0.0845      0.34541        0.683
   16     17       1   0.4571  0.0842      0.31861        0.656
   17     16       3   0.3714  0.0817      0.24138        0.572
   19     13       1   0.3429  0.0802      0.21673        0.542
   20     12       2   0.2857  0.0764      0.16921        0.482
   21     10       2   0.2286  0.0710      0.12437        0.420
   22      8       1   0.2000  0.0676      0.10310        0.388

As I understand it, the lower 95% CI for the median survival time is 0.34541. Searching down the survival column until finding a value < 0.34541 occurs in the row associated with a survival time of 19 (survival = 0.3429). Isn't this the upper bound? Why does survfit return an upper survival time of 20? I've automated this algorithm and most of the time I match the output from survfit but not always.

This leads me to think that either there is some strange error in the survival package (which I doubt), or I'm finding the boundary incorrectly (most likely).

--------- Update

Unfortunately I don't know how to link a data file to my question, but the data is pretty short so I can put it here. Note that I eliminated the stratification by Sex to simplify, thus this is just the data for females which is where I get the discrepancy.

It occurs to me that I was approaching this incorrectly, perhaps the 95% CI is being computed from the standard error, not looked up the way I'm thinking of it. But even with that idea I'm having similar problems. The question is more generally, how does one pull out the Xth percentile survival time with it's corresponding 95% CI in units of time from a survfit object?

Here is the survival input data via dput, and then an unstructured copy below that.

structure(list(daysFromInvestigation = c(27L, 27L, 10L, 20L, 
15L, 21L, 27L, 1L, 9L, 22L, 29L, 14L, 4L, 19L, 7L, 3L, 2L, 7L, 
21L, 4L, 17L, 20L, 16L, 2L, 9L, 7L, 17L, 2L, 17L, 26L, 25L, 11L, 
3L, 13L, 27L), censored = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1)), class = "data.frame", row.names = c(NA, -35L), .Names = c("daysFromInvestigation", 
"censored"))

       daysFromInvestigation censored
1                     27        1
2                     27        1
3                     10        1
4                     20        1
5                     15        1
6                     21        1
7                     27        1
8                      1        1
9                      9        1
10                    22        1
11                    29        1
12                    14        1
13                     4        1
14                    19        1
15                     7        1
16                     3        1
17                     2        1
18                     7        1
19                    21        1
20                     4        1
21                    17        1
22                    20        1
23                    16        1
24                     2        1
25                     9        1
26                     7        1
27                    17        1
28                     2        1
29                    17        1
30                    26        1
31                    25        1
32                    11        1
33                     3        1
34                    13        1
35                    27        1

Upvotes: 3

Views: 1220

Answers (1)

svannoy
svannoy

Reputation: 2005

I have an answer to my own question, at least a good approximate answer if not the best answer.

The main problem that I was having was failing to use a weighted average. In my question, I was interested in median survival time, so survival = 0.5. But my data didn't produce events at the precise median time and thus I have a survival probability of 14 days = 0.5143 and for 15 days = 0.4857, the weighted average of which rounds to 15 days.

The second problem was misunderstanding how to use the confidence intervals. In order to match what the survival package reports, to find the lower bound of the median survival interval, one searches the lower bound vector to find the first value just less than the median, and then computes a weighted average of time for the value just below the median and just above. Likewise, for the upper bound one searches the upper bound vector to find the target interval and then computes a weighted average. For my example, the upper bound of the median survival happens between 19 days and 20 days. The weighted average rounds to 20 days.

I haven't tracked deeply into the survival code to confirm this is how it is done properly, but in my case I've got about 50 specific combinations of survival fits looking at different time periods and different moderators and I am matching the median output provided by the survival package 100%.

I hope anyone who runs into this same question is helped by this summary, and if anyone wants to help correct/refine my understanding, it's most welcome.

Upvotes: 0

Related Questions