Reputation: 43
**creat a new data frame and add a binary column called surv24**
leukemia.data <- data.frame(wbc = leuk$wbc, ag = leuk$ag, time = leuk$time, surv24 =ifelse(leuk$time>=24, 1,0))
Wbc ag time surv24
1 2300 present 65 1
2 750 present 156 1
3 4300 present 100 1
4 2600 present 134 1
5 6000 present 16 0
6 10500 present 108 1
7 4000 absent 17 0
**logistic regression model**
logistic.model <- glm(surv24 ~ log.wbc + ag, family='binomial', data=leukemia.data)
summary(logistic.model)
fit <- predict(logistic.model, type='response')
leuki <- data.frame (cbind(leukemia.data, fit))
ag_present <- subset(leuki[leuki$ag=='present',])
ag_absent <- subset(leuki[leuki$ag=='absent',])
plot(surv24 ~ wbc, data = leukemia.data, main = "Survival Probablity vs Number of White blood cells", xlab = "Number of White blood cells", ylab = "Surviavla Probablity")
lines(ag_present$wbc, ag_present$fit, col='red')
lines(ag_absent$wbc, ag_absent$fit, col='green')
legend(0.8,85000, legend =c("Simple Linear Regression Model Predictions","Quadratic Regression Model Predictions"), col = c("green","red"), lty = 1:2, cex=0.7)
This the output I'm getting for the plot
This is an example of the graph I want to construct
Note that I want to use both (base R) and (ggplot) to construct this graph
Upvotes: 2
Views: 3685
Reputation: 173793
It's not really possible to give a working example as an answer without a bit more sample data than this. I'll therefore attempt to reverse-engineer your data set:
set.seed(123)
wbc <- rnorm(100, 9)
logodds <- c(8 - wbc[1:50], 10 - wbc[51:100])
probs <- exp(logodds)/(1 + exp(logodds))
surv24 <- rbinom(100, 1, probs)
ag <- rep(c("absent", "present"), each = 50)
wbc <- 50 * round(wbc * 20)
leukemia.data <- data.frame(wbc, ag, time = sample(200), surv24)[sample(100),]
head(leukemia.data)
#> wbc ag time surv24
#> 24 8250 absent 133 1
#> 39 8700 absent 63 0
#> 47 8600 absent 76 0
#> 60 9200 present 23 0
#> 63 8650 present 39 1
#> 37 9550 absent 104 0
This looks fairly close. Now we will create your model. It is not clear why you are taking the log of the white cell count: this distorts your final result and isn't necessary, so we will run the model just with white cell count:
logistic.model <- glm(surv24 ~ wbc + ag, family = 'binomial', data = leukemia.data)
summary(logistic.model)
#>
#> Call:
#> glm(formula = surv24 ~ wbc + ag, family = "binomial", data = leukemia.data)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.0564 -0.8820 0.4460 0.8258 1.9888
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 5.7107385 2.4790234 2.304 0.02124 *
#> wbc -0.0007356 0.0002790 -2.636 0.00839 **
#> agpresent 2.1593304 0.4922130 4.387 1.15e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 138.47 on 99 degrees of freedom
#> Residual deviance: 110.56 on 97 degrees of freedom
#> AIC: 116.56
#>
#> Number of Fisher Scoring iterations: 4
To generate nice logistic lines, we need to create a dummy dataset containing samples along the range of the x values we wish to plot:
dummy_df <- data.frame(ag = rep(c("present", "absent"), each = 151),
wbc = rep(seq(0, 15000, 100), 2))
dummy_df$surv24 <- predict(logistic.model, newdata = dummy_df, type = "response")
Now we can plot. It's actually far simpler to do this with ggplot:
library(ggplot2)
ggplot(leukemia.data, aes(wbc, surv24, color = ag)) +
geom_point() +
geom_line(data = dummy_df) +
lims(x = c(0, 15000))
However, to recreate your target plot in base R graphics, you could do something like:
plot(dummy_df$wbc[1:151]/1000, dummy_df$surv24[1:151],
type = "l", col = "green", ylim = c(0, 1),
ylab = "Probability of death prior to 24 weeks",
xlab = "WBC counts", bty = "L")
lines(dummy_df$wbc[152:302]/1000, dummy_df$surv24[152:302], col = "black")
with(leukemia.data, points(wbc[ag == "present"]/1000,
surv24[ag == "present"], col = "red"))
with(leukemia.data, points(wbc[ag == "absent"]/1000,
surv24[ag == "absent"], col = "black"))
legend("topright", legend = c("absent", "present"), title = "AG test",
lty = c(1, 1), col = c("black", "green"), pch = c(1, 2), bty = "L")
Now for the main caveat: since you already have the raw survival times, you should probably run this as a survival analysis, not as logistic regression, since you have lost a lot of statistical power by converting to a binary outcome.
Upvotes: 4