Reputation: 85
I am very new to Machine learning and r, so my question might seem unclear or would need more information. I have tried to explain as much as possible. Please correct me if I have used wrong terminologies or phrases. Any help on this will be greatly appreciated.
Context - I am trying to build a model to predict "when" an event is going to happen.
I have a dataset which has the below structure. This is not the actual data. It is a dummy data created to explain the scenario. Actual data cannot be shared due to confidentiality.
About data -
Customer can use the service above $x. This is indicated by column
"ExceedanceMonth" in the table above. Value of 1 says customer
went above $x in the first month of the subscription, value 5 says
customer went above $x in 5th month of the subscription. Value of
NULL indicates that the limit $x is not reached yet. This could be
either because
subscription ended and customer didn't overuse
or
subscription is yet to end and customer might overuse in future
I have researched on survival models. I understand that a survival model like Cox can be used to understand the hazard function and understand how each variable can affect the time to event. I tried to use predict function with cox but I did not understand if any of the values passed to "type" parameter can be used to predict the actual time. i.e. I did not understand how I can predict the actual value for "WHEN" the limit will be crossed
May be survival model isn't the right approach for this scenario. So, please advise me of what could be the best way to approach this problem.
#define survival object
recsurv <- Surv(time=df$ExceedanceMonth, event=df$LimitReached)
#only for testing the code
train = subset(df,df$SubStartDate>="20150301" & df$SubEndDate<="20180401")
test = subset(df,df$SubStartDate>"20180401") #only for testing the code
fit <- coxph(Surv(df$ExceedanceMonth, df$LimitReached) ~ df$SubDurationInMonths+df$`#subs`+df$LimitAmount+df$Monthlyutitlization+df$AvgMonthlyUtilization, train, model = TRUE)
predicted <- predict(fit, newdata = test)
head(predicted)
1 2 3 4 5 6
0.75347328 0.23516619 -0.05535162 -0.03759123 -0.65658488 -0.54233043
Thank you in advance!
Upvotes: 1
Views: 1456
Reputation: 93
Survival models are fine for what you're trying to do. (I'm assuming you've estimated the model correctly from this point on.)
The key is understanding what comes out of the model. For a Cox, the default quantity out of predict()
is the linear combination (b0 + b1x1 + b2x2..., though the Cox doesn't estimate a b0). That alone won't tell you anything about when.
Specifying type="expected"
for predict()
will give you when via the expected duration--how long, on average, until the customer reaches his/her data limit, with the follow-up time (how long you watch the customer) set equal to the customer's actual duration (retrieved from the coxph
model object).
The coxed
package will also give you expected durations, calculated using a different method, without the need to worry about follow-up time. It's also a little more forgiving when it comes to inputting a newdata
argument, particularly if you have a specific covariate profile in mind. See the package vignette here.
See also this thread for more on coxph.predict()
.
Upvotes: 1