Reputation: 7317
I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
Upvotes: 10
Views: 9695
Reputation: 615
Your dataset consists of 3 observations and only one of them is right censored (2nd observation). As pointed out by @drevicko, it's unclear until which date this 2nd subject was observed. Let's assume this was until 2013-10-01 i.e. for 4 months without an event taking place.
There are 3 option how to encode data which only contains right censoring using survival::Surv()
.
library(survival)
dat <- data.frame(start_date = as.Date(c("2013-06-01", "2013-06-01", "2013-08-01")),
end_date = as.Date(c("2013-08-25", "2013-10-01", "2013-09-12")))
dat$t = as.numeric(difftime(dat$end_date, dat$start_date, units = "days"))/30.5
dat$event <- c(1,0,1)
## Option 1: "right"
Surv(time = dat$t, event = dat$event, type = "right")
#> [1] 2.786885 4.000000+ 1.377049
## Option 2: "interval"
Surv(time = dat$t, time2 = c(NA, NA, NA), event = dat$event, type = "interval")
#> [1] 2.786885 4.000000+ 1.377049
## Option 3: "interval2"
dat$t2 <- dat$t
dat$t2[dat$event == 0] <- Inf
Surv(time = dat$t, time2 = dat$t2, type = "interval2")
#> [1] 2.786885 4.000000+ 1.377049
Created on 2024-07-12 with reprex v2.1.0
I find useful to have some examples of data and the corresponding argument values encoding the data with survival::Surv()
. Censored observations have dashed lines to indicate the range in which the true observation could be.
Upvotes: 0
Reputation: 15180
You need to know the date the data was collected. The tenure_in_months
for id
2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status
of 0 for id
2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).
Upvotes: 0
Reputation: 54340
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months
for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv
object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
Upvotes: 10
Reputation: 6020
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW: I would use difference in days, for my analysis. Does not make sense to round off the time to months.
Upvotes: 1