Reputation: 331
I have 150 columns of scores against 1 column of label (1/0). My goal is to create 150 AUC scores.
Here is a manual example:
auc(roc(df$label, df$col1)),
auc(roc(df$label, df$col2)),
...
I can use here Map/sapply/lapply but is there any other method, or function?
Upvotes: 3
Views: 1552
Reputation: 3688
There's a function for doing that in the cutpointr
package. It also calculates cutpoints and other metrics, but you can discard those. By default it will try all columns except for the response column as predictors. Additionally, you can select whether the direction of the ROC curve (whether larger values imply the positive class or the other way around) is determined automatically by leaving out direction
or set it manually.
dat <- iris[1:100, ]
library(tidyverse)
library(cutpointr)
mc <- multi_cutpointr(data = dat, class = "Species", pos_class = "versicolor",
silent = FALSE)
mc %>% select(variable, direction, AUC)
# A tibble: 4 x 3
variable direction AUC
<chr> <chr> <dbl>
1 Sepal.Length >= 0.933
2 Sepal.Width <= 0.925
3 Petal.Length >= 1.00
4 Petal.Width >= 1.00
By the way, the runtime shouldn't be a problem here because calculating the ROC-curve (even including a cutpoint) takes less than a second for one variable and one million observations using cutpointr
or ROCR
, so your task runs in about one or two minutes.
If memory is the limiting factor, parallelization will probably make that problem worse. If the above solution takes up too much memory, because it returns ROC-curves for all variables before dropping those columns, you can try selecting the columns of interest right away in a call to map
:
# 600.000 observations for 150 variables and a binary outcome
predictors <- matrix(data = rnorm(150 * 6e5), ncol = 150)
dat <- as.data.frame(cbind(y = sample(0:1, size = 6e5, replace = T), predictors))
library(cutpointr)
library(tidyverse)
vars <- colnames(dat)[colnames(dat) != "y"]
result <- map_df(vars, function(coln) {
cutpointr_(dat, x = coln, class = "y", silent = TRUE, pos_class = 1) %>%
select(direction, AUC) %>%
mutate(variable = coln)
})
result
# A tibble: 150 x 3
direction AUC variable
<chr> <dbl> <chr>
1 >= 0.500 V2
2 <= 0.501 V3
3 >= 0.501 V4
4 >= 0.501 V5
5 <= 0.501 V6
6 <= 0.500 V7
7 <= 0.500 V8
8 >= 0.502 V9
9 >= 0.501 V10
10 <= 0.500 V11
# ... with 140 more rows
Upvotes: 3
Reputation: 7959
This is a bit of an XY question. What you actually want to achieve is speed up your calculation. gfgm's answer answers it with parallelization, but that's only one way to go.
If, as I assume, you are using library(pROC)
's roc
/auc
functions, you can gain even more speed by selecting the appropriate algorithm for your dataset.
pROC
comes with essentially two algorithms that scale very differently depending on the characteristics of your data set. You can benchmark which one is the fastest by passing algorithm=0
to roc
:
# generate some toy data
label <- rbinom(600000, 1, 0.5)
score <- rpois(600000, 10)
library(pROC)
roc(label, score, algorithm=0)
Starting benchmark of algorithms 2 and 3, 10 iterations...
expr min lq mean median uq max neval
2 2 4805.58762 5827.75410 5910.40251 6036.52975 6085.8416 6620.733 10
3 3 98.46237 99.05378 99.52434 99.12077 100.0773 101.363 10
Selecting algorithm 3.
Here we select algorithm 3, which shines when the number of thresholds remains low. But if 600000 data points take 5 minutes to compute I strongly suspect that your data is very continuous (no measurements with identical values) and that you have about as many thresholds as data points (600000). In this case you can skip directly to algorithm 2 which scales much better as the number of thresholds in the ROC curve increases.
You can then run:
auc(roc(df$label, df$col1, algorithm=2)),
auc(roc(df$label, df$col2, algorithm=2)),
On my machine each call to roc
now takes about 5 seconds, pretty independently of the number of thresholds. This way you should be done in under 15 minutes total. Unless you have 50 cores or more this is going to be faster than just parallelizing. But of course you can do both...
Upvotes: 6
Reputation: 3647
If you want to parallelize the computations you could do it like this:
# generate some toy data
label <- rbinom(1000, 1, .5)
scores <- matrix(runif(1000*150), ncol = 150)
df <- data.frame(label, scores)
library(pROC)
library(parallel)
auc(roc(df$label, df$X1))
#> Area under the curve: 0.5103
auc_res <- mclapply(df[,2:ncol(df)], function(row){auc(roc(df$label, row))})
head(auc_res)
#> $X1
#> Area under the curve: 0.5103
#>
#> $X2
#> Area under the curve: 0.5235
#>
#> $X3
#> Area under the curve: 0.5181
#>
#> $X4
#> Area under the curve: 0.5119
#>
#> $X5
#> Area under the curve: 0.5083
#>
#> $X6
#> Area under the curve: 0.5159
Since most of the computational time seems to be the call to auc(roc(...))
this should speed things up if you have a multi-core machine.
Upvotes: 4