Reputation: 1
covariates <- c("age", "sex", "ph.karno", "ph.ecog", "wt.loss")
univ_formulas <- sapply(covariates,
function(x) as.formula(paste('Surv(time, status)~', x)))
univ_models <- lapply( univ_formulas, function(x){coxph(x, data = lung)})
# Extract data
univ_results <- lapply(univ_models,
function(x){
x <- summary(x)
p.value<-signif(x$wald["pvalue"], digits=2)
wald.test<-signif(x$wald["test"], digits=2)
beta<-signif(x$coef[1], digits=2);#coeficient beta
HR <-signif(x$coef[2], digits=2);#exp(beta)
HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
HR <- paste0(HR, " (",
HR.confint.lower, "-", HR.confint.upper, ")")
res<-c(beta, HR, wald.test, p.value)
names(res)<-c("beta", "HR (95% CI for HR)", "wald.test",
"p.value")
return(res)
#return(exp(cbind(coef(x),confint(x))))
})
res <- t(as.data.frame(univ_results, check.names = FALSE))
as.data.frame(res)
Normally I use this code for univariate cox regression analysis but I have multiple genes >20000 that I want to run as independent variables in a univariate cox regression analysis and I am not sure how I can run this code without typing the individual covariates (gene names) out. All my column names for the genes begin with "ENSG..".
Is there a way to do univariate cox regression on so many genes in an efficient way please? Thanks in advance.
Upvotes: 0
Views: 734
Reputation: 43
you already did it. If you check the univ_formula (like univ_formula[1], and so on), then you can see it contains each term (covariate) independently.
For you gene data, you can modify the covariates variable and the rest is same:
covariates <- rownames(df) # of your dataframe or columnanes(df) of your df; I hope the genes should be in your column names.
#The following will create (paste) formula variable containing formula and gene name:
univ_formulas <- sapply(covariates, function(x) as.formula(paste('Surv(time, status)~', x)))
The rest is the same as you used to do.
There is one more package (geneSA; https://github.com/huynguyen250896/geneSA) that can also be used.
Upvotes: 0
Reputation: 694
There are several ways to make the list of variable names without typing them out. Probably one of the easiest is to use names()
to get all of the variable names in the data, then remove time
and status
from that list (as well as any other variables you don't want to include). For example, for the veteran
dataset:
covariates <- names(survival::veteran)
covariates # Look at which names were detected
#> [1] "trt" "celltype" "time" "status" "karno" "diagtime" "age"
#> [8] "prior"
covariates <- covariates[-which(covariates %in% c("time", "status"))]
covariates # Confirm time and status have been removed
#> [1] "trt" "celltype" "karno" "diagtime" "age" "prior"
Created on 2022-08-30 by the reprex package (v2.0.1)
You could also programmatically create a list of names. For example:
covariates <- sapply(1:10, FUN = function(x) paste0("ENSG.", x))
covariates
#> [1] "ENSG.1" "ENSG.2" "ENSG.3" "ENSG.4" "ENSG.5" "ENSG.6" "ENSG.7"
#> [8] "ENSG.8" "ENSG.9" "ENSG.10"
This approach might be better if the naming is easy to program. If the gene names are irregular it might be more difficult.
As far as efficiency, I don't think much can be done to improve the overall runtime. The bulk of the runtime is doing the actually coxph()
calculations. There are other questions/answers on the site about optimizing R code. If you want to pursue optimization I'd suggest looking through those, and then making a new question with a reproducible example if you need more help.
Upvotes: 0