Maverick
Maverick

Reputation: 711

Explanation of the formula object used in the coxph function in R

I am a complete novice when it comes to survival analysis. I am working on a project that requires I use the coxph function in the "survival" package, but I am running into trouble because I do not understand what is required by the formula object.

Most descriptions I can find about the function are as follows:

"a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the Surv function. "

I know what needs to be on the left of the operator, the issue is what the function expects from the right-hand side.

Here is a link of what my data looks like (The actual data set is much larger, I'm only displaying the first 20 data points for brevity):

http://imageshack.com/scaled/large/560/7n80.png

Short explanation of data:

-Row 1 is the header

-Each row after that is a separate patient

-The first column is the age of the patient at the time of the study

-columns 2 through 14 (headed by x2-x13), and 19 (x18) and 20 (x19) are covariates such as race, relationship status, medical conditions that take on either true (1) or false (0) values. 

-columns 15 (x14) through 18 (x17) are covariates such as tumor size, which take on whole number values greater than 0.

-The second to last column "sur" is the number of months survived, and "index" is whether or not that is a right-censored time (1 for true, 0 for false). 

Given this data I need to plot a Cox Proportional hazard curve, but I end up with an incorrect plot because the right hand side of the formula object is wrong.

Here is my code, "temp4" is the name I gave to the data table:

library("survival")
temp4 <- read.table("~/data.txt", header=TRUE)
seerCox <- coxph(Surv(sur, index)~ temp4$x1 + temp4$x2 + temp4$x3 + temp4$x4 + temp4$x5 + temp4$x6 + temp4$x7 + temp4$x8 + temp4$x9 + temp4$x10 + temp4$x11 + temp4$x12 + temp4$x13 + temp4$x14 + temp4$x15 + temp4$x16 + temp4$x17 + temp4$x18 + temp4$x19, data=temp4, singular.ok=TRUE)
plot(survfit(seerCox), main= "Cox Estimate", mark.time=FALSE, ylab="Probability", xlab="Survival Time in Months", col=c("blue", "red", "green"))

I should also note that I have tried replacing the right hand side that you're seeing with the number 1, a period, leaving it blank. These methods produce a kaplan-meier curve.

The following is the console output:

http://imageshack.com/scaled/large/703/px7.png

Each new line is an example of the error produced depending on how I filter the data. (ie if I only include patients with ages greater than 85, etc.)

If someone could explain how it works, it would be greatly appreciated.

PS- I have searched for over a week to my solution, and I am asking for help here as a last resort.

Upvotes: 1

Views: 1691

Answers (1)

IRTFM
IRTFM

Reputation: 263411

You should not be using the prefix temp$ if you are also using a data argument. The whole purpose of supplying a data argument is to allow dropping those in the formula.

seerCox <- coxph( Surv(sur, index) ~ . , data=temp4, singular.ok=TRUE)

The above would use all of the x-variables in your temp data.frame. This will use just the first 3:

seerCox <- coxph( Surv(sur, index) ~ x1+x2+x3 , data=temp4)

Exactly what the warnings signify depends on the data (as you have in one sense already exemplified by producing different sorts of collinearity with different subsets.) If you have collinear columns, then you get singularities in the inversion of the model matrix and the software will attempt to drop aliased columns with a warning. This is really telling you that you do not have enough data to build the large models you are attempting. Exploring that possibility with table calls is often informative.

Bottom line: This is not a problem with your formula construction, so much as it is a problem of not understanding the limitations of the chosen method with the dataset you have assembled. You need to be more careful about defining your goals. What is the highest priority in this research? Do you really need every variable? Is it possible to aggregate some of these anonymous variables into clinically meaningful categories such as diagnostic categories or comorbities?

Upvotes: 1

Related Questions