emmarajan
emmarajan

Reputation: 25

Compute sensitivity, specificity, and more using multiple input variables in R

Preamble: the question I am going to ask can be considered a follow up of this discussion, for which a nice answer was provided. Also, I was given extremely helpful advice here, and the idea of what I am dealing with now goes into a similar direction.

I am creating a largely automated dashboard and, therefore, look for ways to generalise whenever possible. Here, I have a dataframe (in the long format, work mostly done with packages from the tidyverse) with

Based on this information, I would like to obtain an output that looks like this:

METHODEKURZ COMORB Sensitivity Specificity PPV NPV
A COM1 0.49 0.22 0.31 0.11
B COM1 0.31 0.22 0.22 0.49
C COM1 0.22 0.49 0.31 0.22
D COM1 0.49 0.22 0.31 0.11
A COM2 0.22 0.22 0.49 0.11
B COM2 0.49 0.22 0.31 0.22
C COM2 0.31 0.22 0.31 0.22
D COM2 0.31 0.22 0.31 0.49

If the question was solely to provide such an output with variable METHODEKURZ, the approach shown here and rendered below would be adequate and has shown to work well:

library(tidyverse)

my_df <- structure(
  list(
    a = c('A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D'), 
    b = c(0,0,1,1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0), 
    c = c('COM1','COM1','COM1','COM1','COM2','COM2','COM2','COM2','COM3','COM3','COM3','COM3', 'COM4','COM4','COM4','COM4','COM5','COM5','COM5','COM5'),
    d = c(1,1,0,0,0,1,0,0,1,0,1,0,0,0,1,1,1,0,1,1) 
  ), 
  .Names = c("METHODEKURZ", "CLASS_INT", "COMORB", "VALUES"), 
  row.names = c(NA, 20L), 
  class = "data.frame") %>%
  mutate(across(c(contains('VALUES')), 
                ~as.factor(.))) %>%
  mutate(across(c(contains('CLASS_INT')), 
                ~as.factor(.))) 

t(sapply(sort(unique(my_df$METHODEKURZ)), function(i) { 
  
  q <- confusionMatrix(data      = my_df$CLASS_INT[my_df$METHODEKURZ == i],
                       reference = my_df$VALUES[my_df$METHODEKURZ == i])$table
  
  c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
    specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
    ppv         = q[1, 1] / (q[1, 1] + q[1, 2]),
    npv         = q[2, 2] / (q[2, 2] + q[2, 1]))
}))

However, I have COMORB as an additional variable, which I would love to be taken into consideration. Could anybody help me modify the code in a way to include COMORB as a variable? I will use the output as a table but will likely also invest some time into finding a good way to visualise it. Thanks a lot for all your help in advance.

Upvotes: 1

Views: 222

Answers (1)

marcguery
marcguery

Reputation: 586

Store each combination of variables into a data frame using expand.grid and compute the statistics using the values corresponding to each individual set of variables.

library(caret)

# Generate all the combinations of variables using expand.grid
var_combinations <- expand.grid("METHODEKURZ" = unique(my_df$METHODEKURZ), 
                                "COMORB" = unique(my_df$COMORB))

cbind(var_combinations, t(apply(var_combinations, 1, function(i) {
  set_of_rows <- my_df$METHODEKURZ == i[1] & my_df$COMORB == i[2]
  q <- confusionMatrix(data      = my_df$CLASS_INT[set_of_rows],
                       reference = my_df$VALUES[set_of_rows])$table
  
  c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
    specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
    ppv         = q[1, 1] / (q[1, 1] + q[1, 2]),
    npv         = q[2, 2] / (q[2, 2] + q[2, 1]))
})))

#   METHODEKURZ COMORB sensitivity specificity       ppv       npv
#1            A   COM1   1.0000000   0.6666667 0.6666667 1.0000000
#2            B   COM1   1.0000000   0.2500000 0.2500000 1.0000000
#3            C   COM1   0.3333333   0.5000000 0.5000000 0.3333333
#4            D   COM1   0.0000000   0.3333333 0.0000000 0.3333333
#5            A   COM2   1.0000000   0.0000000 0.6000000       NaN
#6            B   COM2   0.0000000   0.5000000 0.0000000 0.6666667
#7            C   COM2   1.0000000   0.5000000 0.3333333 1.0000000
#8            D   COM2   0.2500000   0.0000000 0.5000000 0.0000000
#9            A   COM3   0.5000000   0.0000000 0.2500000 0.0000000
#10           B   COM3   1.0000000   0.2500000 0.2500000 1.0000000
#11           C   COM3   0.3333333   0.5000000 0.5000000 0.3333333
#12           D   COM3   0.5000000   0.0000000 0.6666667 0.0000000
#13           A   COM4   0.6666667   0.0000000 0.5000000 0.0000000
#14           B   COM4   1.0000000   0.5000000 0.3333333 1.0000000
#15           C   COM4   1.0000000   1.0000000 1.0000000 1.0000000
#16           D   COM4   0.5000000   0.3333333 0.3333333 0.5000000
#17           A   COM5   0.5000000   1.0000000 1.0000000 0.3333333
#18           B   COM5   0.0000000   0.7500000 0.0000000 0.7500000
#19           C   COM5   1.0000000   0.6666667 0.6666667 1.0000000
#20           D   COM5   0.5000000   0.0000000 0.6666667 0.0000000

Raw data

I generated more values to get several observations for each combination of variables.

library(dplyr)

#For reproducibility
set.seed(123)

my_df <- structure(
  list(
    a = rep(c('A','B','C','D'),length.out = 100), 
    b = sample(c(0,1),100, replace = TRUE), 
    c = c(rep('COM1',20),rep('COM2',20),rep('COM3',20),rep('COM4',20), rep('COM5',20)),
    d = sample(c(0,1),100, replace = TRUE)
  ), 
  .Names = c("METHODEKURZ", "CLASS_INT", "COMORB", "VALUES"), 
  row.names = c(NA, 100L), 
  class = "data.frame") %>%
  mutate(across(c(contains('VALUES')), 
                ~as.factor(.))) %>%
  mutate(across(c(contains('CLASS_INT')), 
                ~as.factor(.))) 

Upvotes: 1

Related Questions