Reputation: 127
I am implementing a version of the Very Large Scale Relieff algorithm detailed here.
Simply put, Very Large Scale Relieff split the set of features N
into several random subsets Ns
where Ns << N
. Then it calculates the Relieff weights for the features in the subset Ns
. For each feature, the final weight will be the highest weight assigned among the different subsets were that particular feature appear.
I have ~80000 features for ~100 subjects. I can calculate 10000 subsets of 8000 features each in a reasonable amount of time (~5 minutes running on 25 cores) with the following code (that is scaled down to 100 features in order to be easier to profile):
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
#create fake data for example
fake_table <- matrix(rnorm(100*100), ncol = 100) %>%
as_tibble()
outcome <- rnorm(100)
#create fake data for example
#VLSRelieff code
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
result <- foreach(x = seq(1,10000)) %dopar% {
#set seed for results consistency among different run
set.seed(x)
#subsample the features table by extracting a subset of columns
subset_index <- sample(seq(1,ncol(fake_table)),size = round(ncol(fake_table)*.01))
subset_matrix <- fake_table[,subset_index]
#assign the outcome as last column of the subset
subset_matrix[,ncol(subset_matrix)+1] <- outcome
#use the function attrEval from the CORElearn package to calculate the Relieff weights for the subset
rf_weights <- attrEval(formula = ncol(subset_matrix), subset_matrix, estimator = "RReliefFequalK")
#create a data frame with as many columns as features in the subset and only one row
#with the Relieff weigths
rf_df <- rf_weights %>%
unname() %>%
matrix(., ncol = length(.), byrow = TRUE) %>%
as_tibble() %>%
set_colnames(., names(rf_weights))}
end_time <- Sys.time()
end_time - start_time
However, the code above does only half of the work: the other half is, for each feature, to go into the results of the different repetitions and find the maximum value obtained. I have managed to write a working code, but it is outrageously slow (I let it run for 2 hours before stopping it, although it worked on testing with fewer features - again, here it is scaled down to 100 features and should run in ~7 seconds):
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
type = "FORK")
registerDoParallel(myCluster)
#get all features name
feat_names <- colnames(fake_table)
#initalize an empty vector of zeros, with the names of the features
feat_wegiths <- rep(0, length(feat_names))
names(feat_wegiths) <- feat_names
#loop in parallel on the features name, for each feature name
feat_weight_foreach <- foreach(feat = feat_names, .combine = 'c') %dopar% {
#initalize the weight as 0
current_weigth <- 0
#for all element in result (i.e. repetitions of the subsampling process)
for (el in 1:length(result)){
#assign new weight accessing the table
new_weigth <- result[[el]][[1,feat]]
#skip is empty (i.e. the features is not present in the current subset)
if(is_empty(new_weigth)){next}
#if new weight is higher than current weight assign the value to current weight
if (current_weigth < new_weigth){
current_weigth <- new_weigth}}
current_weigth
}
end_time <- Sys.time()
end_time - start_time
Upvotes: 2
Views: 643
Reputation: 11255
This follows @DS_UNI's idea, but instead of binding a list, the approach is to create a matrix from the initial loop. That is, a list of tibbles makes us do extra work. Instead, we have every thing we need to make a matrix:
library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)
nr = 50L
nc = 200L
## generate data
set.seed(123)
mat = matrix(rnorm(nr * nc), ncol = nc, dimnames = list(NULL, paste0('V', seq_len(nc))))
outcome = rnorm(nr)
## constants for sampling
n_reps = nc
nc_sample_size = round(nc * 0.01)
## pre-allocate result
res = matrix(0, nrow = n_reps, ncol = ncol(mat), dimnames = dimnames(mat))
st = Sys.time()
for (i in seq_len(n_reps)) {
set.seed(i)
## similar way to do data simulations as OP
sub_cols = sample(seq_len(nc), nc_sample_size)
sub_mat = cbind(mat[, sub_cols], outcome)
rf_weights = attrEval(formula = ncol(sub_mat), as.data.frame(sub_mat), estimator = 'RReliefFequalK')
## assign back to pre-allocated result
res[i, sub_cols] = rf_weights
}
## get max of each column
apply(res, 2L, max)
et = Sys.time()
et - st
The downsides is that this loses the parallel workers. The upside is that we have less memory slowdowns because we're allocating much of what we need up front.
Upvotes: 1
Reputation: 349
This is not a final answer, but I would suggest, since it is a numerical problem, to write a function in C++. This will increase the speed significantly, by some order of magnitude I would guess. In my oppinion, using R for this very specific numercial task is just hitting a brick wall.
The first chapter of Rcpp for everyone says:
Chapter 1 Suitable situations to use Rcpp R is weak in some kinds of operations. If you need operations listed below, it is time to consider using Rcpp.
Wickham's Advanced R has a good chapter on that topic too.
Upvotes: 0
Reputation: 2650
If I understood what you are trying to do correctly, then the answer is simpler than you think.
Correct me if I'm wrong, but you are trying to get the max value obtained from attrEval per feature? if so, then why not just bind all results in one dataframe (or data.table), and then get the max per column like so:
allResults <- result %>% data.table::rbindlist(fill = TRUE)
apply(allResults, 2, max, na.rm=TRUE)
Upvotes: 2