Reputation: 642
I want to convert a cox table to forest plot as showed below. Unforunatly I’ve lost my original data (coxph object) so I have to use the data from the table. Data below are just examples:
Desired output:
Reprex for the two tables:
GRP1<-tibble::tribble(
~Variable, ~Level, ~Number, ~`HR.(univariable)`, ~`HR.(multivariable)`,
"Sex", "Female", "2204 (100.0)", NA, NA,
NA, "Male", "2318 (100.0)", "1.13 (0.91-1.40, p=0.265)", "1.13 (0.91-1.40, p=0.276)",
"Score", "1", "2401 (100.0)", NA, NA,
NA, "1-2", "1637 (100.0)", "1.49 (1.19-1.87, p=0.001)", "1.15 (0.90-1.47, p=0.250)",
NA, "3-4", "412 (100.0)", "1.71 (1.14-2.56, p=0.010)", "1.09 (0.71-1.67, p=0.710)",
NA, ">=5", "42 (100.0)", "1.67 (0.53-5.21, p=0.381)", "0.96 (0.30-3.05, p=0.943)",
"Treatment", "A", "1572 (100.0)", NA, NA,
NA, "B", "2951 (100.0)", "1.74 (1.26-2.40, p=0.001)", "1.53 (1.09-2.13, p=0.013)"
)
GRP2<-tibble::tribble(
~Variable, ~Level, ~Number, ~`HR.(univariable)`, ~`HR.(univariable)`,
"Sex", "Female", "2204 (100.0)", NA, NA,
NA, "Male", "2318 (100.0)", "1.70 (1.36-2.13, p<0.001)", "1.62 (1.28-2.04, p<0.001)",
"Score", "1", "2401 (100.0)", NA, NA,
NA, "1-2", "1637 (100.0)", "2.76 (1.21-6.29, p=0.016)", "2.69 (1.18-6.13, p=0.019)",
NA, "3-4", "412 (100.0)", "5.11 (2.26-11.58, p<0.001)", "4.46 (1.95-10.23, p<0.001)",
NA, ">=5", "42 (100.0)", "5.05 (2.19-11.64, p<0.001)", "4.08 (1.73-9.59, p=0.001)",
"Treatment", "A", "1572 (100.0)", NA, NA,
NA, "B", "2951 (100.0)", "1.48 (1.16-1.88, p=0.001)", "1.23 (0.95-1.59, p=0.114)"
)
Is it doable?
Best regards, H
Upvotes: 0
Views: 1719
Reputation: 174641
The difficult thing about this task is not making the plot; it is converting your data from a bunch of text strings into a single long-format data frame that can be used for plotting. This involves using regular expressions to capture the appropriate number for each column, pivoting the result, then repeating that process for the second data frame before binding the two frames together. This is unavoidably ugly and complicated, but that is one of the reasons why having data stored in the correct format is so important.
Anyway, the following code performs the necessary operations:
library(dplyr)
wrangler <- function(data){
grp <- as.character(match.call()$data)
data %>%
tidyr::fill(Variable) %>%
mutate(Variable = paste(Variable, Level),
Number = as.numeric(gsub("^(\\d+).*$", "\\1", Number)),
univariable_HR = as.numeric(gsub("^((\\d+|\\.)+).*$", "\\1", `HR.(univariable)`)),
univariable_lower = as.numeric(gsub("^.+? \\((.+?)-.*$", "\\1", `HR.(univariable)`)),
univariable_upper = as.numeric(gsub("^.+?-(.+?),.*$", "\\1", `HR.(univariable)`)),
univariable_p = gsub("^.+?p=*(.+?)\\).*$", "\\1", `HR.(univariable)`),
multivariable_HR = as.numeric(gsub("^((\\d+|\\.)+).*$", "\\1", `HR.(multivariable)`)),
multivariable_lower = as.numeric(gsub("^.+? \\((.+?)-.*$", "\\1", `HR.(multivariable)`)),
multivariable_upper = as.numeric(gsub("^.+?-(.+?),.*$", "\\1", `HR.(multivariable)`)),
multivariable_p = gsub("^.+?p=*(.+?)\\).*$", "\\1", `HR.(multivariable)`),
group = grp) %>%
filter(!is.na(univariable_HR)) %>%
select(-Level, -`HR.(multivariable)`, - `HR.(univariable)`) %>%
tidyr::pivot_longer(cols = -(c(1:2, 11)), names_sep = "_", names_to = c("type", ".value"))
}
df <- rbind(wrangler(GRP1), wrangler(GRP2))
This now gives us the data in the correct format for plotting. Each row will become a single pointrange in our plot, so it needs a hazard ratio, a lower confidence bound, an upper confidence bound, a variable label, the type (multivariable versus univariable), and the group it originally came from (GRP1 or GRP2):
df
#> # A tibble: 20 x 8
#> Variable Number group type HR lower upper p
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 Sex Male 2318 GRP1 univariable 1.13 0.91 1.4 0.265
#> 2 Sex Male 2318 GRP1 multivariable 1.13 0.91 1.4 0.276
#> 3 Score 1-2 1637 GRP1 univariable 1.49 1.19 1.87 0.001
#> 4 Score 1-2 1637 GRP1 multivariable 1.15 0.9 1.47 0.250
#> 5 Score 3-4 412 GRP1 univariable 1.71 1.14 2.56 0.010
#> 6 Score 3-4 412 GRP1 multivariable 1.09 0.71 1.67 0.710
#> 7 Score >=5 42 GRP1 univariable 1.67 0.53 5.21 0.381
#> 8 Score >=5 42 GRP1 multivariable 0.96 0.3 3.05 0.943
#> 9 Treatment B 2951 GRP1 univariable 1.74 1.26 2.4 0.001
#> 10 Treatment B 2951 GRP1 multivariable 1.53 1.09 2.13 0.013
#> 11 Sex Male 2318 GRP2 univariable 1.7 1.36 2.13 <0.001
#> 12 Sex Male 2318 GRP2 multivariable 1.62 1.28 2.04 <0.001
#> 13 Score 1-2 1637 GRP2 univariable 2.76 1.21 6.29 0.016
#> 14 Score 1-2 1637 GRP2 multivariable 2.69 1.18 6.13 0.019
#> 15 Score 3-4 412 GRP2 univariable 5.11 2.26 11.6 <0.001
#> 16 Score 3-4 412 GRP2 multivariable 4.46 1.95 10.2 <0.001
#> 17 Score >=5 42 GRP2 univariable 5.05 2.19 11.6 <0.001
#> 18 Score >=5 42 GRP2 multivariable 4.08 1.73 9.59 0.001
#> 19 Treatment B 2951 GRP2 univariable 1.48 1.16 1.88 0.001
#> 20 Treatment B 2951 GRP2 multivariable 1.23 0.95 1.59 0.114
Now that we have the data in this format, the plot itself is straightforward:
library(ggplot2)
ggplot(df, aes(HR, Variable)) +
geom_pointrange(aes(xmin = lower, xmax = upper, colour = type),
position = position_dodge(width = 0.5)) +
facet_grid(group~., switch = "y") +
geom_vline(xintercept = 0, linetype = 2) +
theme_bw() +
theme(strip.placement = "outside",
strip.text= element_text(angle = 180),
strip.background = element_blank(),
panel.spacing = unit(0, "mm"))
Created on 2021-11-01 by the reprex package (v2.0.0)
Upvotes: 4