fednem
fednem

Reputation: 127

What is the fastest way to perform an exhaustive search in R

I am implementing a version of the Very Large Scale Relieff algorithm detailed here.

Simply put, Very Large Scale Relieff split the set of features N into several random subsets Ns where Ns << N. Then it calculates the Relieff weights for the features in the subset Ns. For each feature, the final weight will be the highest weight assigned among the different subsets were that particular feature appear.

I have ~80000 features for ~100 subjects. I can calculate 10000 subsets of 8000 features each in a reasonable amount of time (~5 minutes running on 25 cores) with the following code (that is scaled down to 100 features in order to be easier to profile):

library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)

#create fake data for example
fake_table <- matrix(rnorm(100*100), ncol = 100) %>%
  as_tibble()
outcome <- rnorm(100)
#create fake data for example

#VLSRelieff code
start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
                         type = "FORK")
registerDoParallel(myCluster)
result <- foreach(x = seq(1,10000)) %dopar% {
  #set seed for results consistency among different run
  set.seed(x)
  #subsample the features table by extracting a subset of columns
  subset_index <- sample(seq(1,ncol(fake_table)),size = round(ncol(fake_table)*.01))
  subset_matrix <- fake_table[,subset_index]
  #assign the outcome as last column of the subset
  subset_matrix[,ncol(subset_matrix)+1] <- outcome
  #use the function attrEval from the CORElearn package to calculate the Relieff weights for the subset
  rf_weights <- attrEval(formula = ncol(subset_matrix), subset_matrix, estimator = "RReliefFequalK")
  #create a data frame with as many columns as features in the subset and only one row 
  #with the Relieff weigths
  rf_df <- rf_weights %>%
    unname() %>%
    matrix(., ncol = length(.), byrow = TRUE) %>%
    as_tibble() %>%
    set_colnames(., names(rf_weights))}
end_time <- Sys.time()
end_time - start_time

However, the code above does only half of the work: the other half is, for each feature, to go into the results of the different repetitions and find the maximum value obtained. I have managed to write a working code, but it is outrageously slow (I let it run for 2 hours before stopping it, although it worked on testing with fewer features - again, here it is scaled down to 100 features and should run in ~7 seconds):

start_time <- Sys.time()
myCluster <- makeCluster(25, # number of cores to use
                         type = "FORK")
registerDoParallel(myCluster)
#get all features name
feat_names <- colnames(fake_table)
#initalize an empty vector of zeros, with the names of the features
feat_wegiths <- rep(0, length(feat_names))
names(feat_wegiths) <- feat_names
#loop in parallel on the features name, for each feature name
feat_weight_foreach <- foreach(feat = feat_names, .combine = 'c') %dopar% {
  #initalize the weight as 0
  current_weigth <- 0
  #for all element in result (i.e. repetitions of the subsampling process)
  for (el in 1:length(result)){
    #assign new weight accessing the table
    new_weigth <- result[[el]][[1,feat]]
    #skip is empty (i.e. the features is not present in the current subset)
    if(is_empty(new_weigth)){next}
    #if new weight is higher than current weight assign the value to current weight
    if (current_weigth < new_weigth){
      current_weigth <- new_weigth}}
  current_weigth
}
end_time <- Sys.time()
end_time - start_time

Upvotes: 2

Views: 643

Answers (3)

Cole
Cole

Reputation: 11255

This follows @DS_UNI's idea, but instead of binding a list, the approach is to create a matrix from the initial loop. That is, a list of tibbles makes us do extra work. Instead, we have every thing we need to make a matrix:

library(tidyverse)
library(magrittr)
library(CORElearn)
library(doParallel)

nr = 50L
nc = 200L

## generate data
set.seed(123)
mat = matrix(rnorm(nr * nc), ncol = nc, dimnames = list(NULL, paste0('V', seq_len(nc))))
outcome = rnorm(nr) 

## constants for sampling
n_reps = nc
nc_sample_size = round(nc * 0.01)

## pre-allocate result
res = matrix(0, nrow = n_reps, ncol = ncol(mat), dimnames = dimnames(mat))

st = Sys.time()
for (i in seq_len(n_reps)) {
  set.seed(i) 
  
  ## similar way to do data simulations as OP
  sub_cols = sample(seq_len(nc), nc_sample_size)
  sub_mat = cbind(mat[, sub_cols], outcome)
  rf_weights = attrEval(formula = ncol(sub_mat), as.data.frame(sub_mat), estimator = 'RReliefFequalK')
  
  ## assign back to pre-allocated result
  res[i, sub_cols] = rf_weights
}

## get max of each column
apply(res, 2L, max)
et = Sys.time()
et - st

The downsides is that this loses the parallel workers. The upside is that we have less memory slowdowns because we're allocating much of what we need up front.

Upvotes: 1

xaviescacs
xaviescacs

Reputation: 349

This is not a final answer, but I would suggest, since it is a numerical problem, to write a function in C++. This will increase the speed significantly, by some order of magnitude I would guess. In my oppinion, using R for this very specific numercial task is just hitting a brick wall.

The first chapter of Rcpp for everyone says:

Chapter 1 Suitable situations to use Rcpp R is weak in some kinds of operations. If you need operations listed below, it is time to consider using Rcpp.

  • Loop operations in which later iterations depend on previous iterations.
  • Accessing each element of a vector/matrix.
  • Recurrent function calls within loops.
  • Changing the size of vectors dynamically.
  • Operations that need advanced data structures and algorithms.

Wickham's Advanced R has a good chapter on that topic too.

Upvotes: 0

DS_UNI
DS_UNI

Reputation: 2650

If I understood what you are trying to do correctly, then the answer is simpler than you think.

Correct me if I'm wrong, but you are trying to get the max value obtained from attrEval per feature? if so, then why not just bind all results in one dataframe (or data.table), and then get the max per column like so:

allResults <- result %>% data.table::rbindlist(fill = TRUE)
apply(allResults, 2, max, na.rm=TRUE)

Upvotes: 2

Related Questions