Reputation: 11
My goal is estimate the median survival with upper and lower confidence limits for the median at 90% confidence levels, using a survfit object.
churn_dat <-read_csv("https://raw.githubusercontent.com/square/pysurvival/master/pysurvival/datasets/churn.csv")
churn_dat <- churn_dat %>% filter(months_active > 0)
#create a function of the dataframe by sizes
boot <- function(size,n_sims){
#1. filter data into a particular size
df <- churn_dat %>% filter(company_size == size)
n = nrow(df)
#2. run the bootstrap
experiments = tibble(experiment = rep(1:n_sims, each = n),
index = sample(1:n, size = n * n_sims, replace = TRUE),
time_star = df$months_active[index],
event_star = df$churned[index])
return(experiments)
}
#create a function for plotting
plot_boot_data <- function(experiments){
fit <- survfit(Surv(time_star, event_star) ~ experiment, data = experiments)
#get the median of surv
med <- surv_median(fit)
med <- data.frame(med = med$median)
ggplot(med , aes(x = med, fill= med)) +
geom_histogram(binwidth = .8)+theme_bw()
}
df_10to50 <- boot("10-50",10)
plot_boot_data(df_10to50)
How can i construct the same thing with confidence levels set to 90 %
Upvotes: 1
Views: 1336
Reputation: 263362
The surv_median
-function in pkg:survminer
is essentially doing what someone doing a screen scrape of the console would do after running the non-exposed survmean
function in pkg:survival
. (Note the need for the triple-colon (':::') extraction operator from package survival.) surv_median
uses the hard coded column names and so is unable to deal with a fit
-object that was constructed with a different value of the conf.int
parameter in the results of a call to survfit
. If you want the output of the survmean
-function from such a call, it's not at all difficult. Using your data:
fit <- survfit(Surv(time_star, event_star) ~ experiment, data = df_10to50, conf.int=0.9)
med <- survival:::survmean(fit,rmean=FALSE)
med # result is a named list
#------------
$matrix
records n.max n.start events rmean se(rmean) median 0.9LCL 0.9UCL
experiment=1 673 673 673 298 7.347565 0.2000873 7 5 12
experiment=2 673 673 673 309 7.152863 0.2028425 6 5 10
experiment=3 673 673 673 298 7.345891 0.2068490 9 5 12
experiment=4 673 673 673 323 7.035011 0.1981676 5 4 7
experiment=5 673 673 673 313 7.044400 0.2074104 6 5 9
experiment=6 673 673 673 317 7.061878 0.2021348 6 4 9
experiment=7 673 673 673 311 7.029602 0.2081835 5 4 9
experiment=8 673 673 673 301 7.345766 0.2032876 9 6 10
experiment=9 673 673 673 318 6.912700 0.2050143 7 5 9
experiment=10 673 673 673 327 6.988065 0.1990601 5 4 7
$end.time
[1] 12 12 12 12 12 12 12 12 12 12
If you wanted the median and bounds at 0.9 confidence level it could be obtained with:
med$matrix[ 1 , 7:9] # using numbers instead of column names.
#----------
median 0.9LCL 0.9UCL
7 5 12
I'm afraid there is not a sufficient description of your goal of the process of getting there for me to make sense of the dplyr/magrittr chain of logic, so I'm unable to fill in the proper places in the boot function or the handling of its output by ggplot2
. I was initially very confused because you were using a function named boot
and I thought you were doing bootstrapped analysis, but there didn't seem to be any mechanism for getting any bootstrapped results, i.e. no randomized selection of rows in an indexable dataset.
If you still wanted to make a purpose-built variant of surv_median you might try modifying this line inside the code:
.table <- .table %>% dplyr::select_(
.dots = c("strata", "median", "`0.95LCL`", "`0.95UCL`"))
I wasn't able to figure out what surv_median
was doing with the "strata" column since it didn't match the output of survmean
, but that probably because it was using summary.survfit
rather than going directly to the function that summary.survfit
calls to do the calculations. So happy hacking.
Upvotes: 1