Problem with parallel computing over an index using foreach and %dopar% in R

Question

I am trying to efficiently code the computation of the KDN complexity measure, which involves a loop over all of the rows of a distance matrix and making some computations out of it.

I am trying to parallel this code with foreach and %dopar% functions, but I am not achieving any running time reduction. I am conscious that some parallel computations are not efficient because of memory management, but I don’t know if this is my case or if I am doing something wrong.

This is a reproducible example with digits data from rsvd package:

First, I call all the necessary packages, I read the digits data and then I get some useful information.

#######################
### NEEDED PACKAGES ###
#######################

library(dplyr)
library(parallelDist)
# for Parallel Processing
library(doParallel)  
library(foreach)
# for digits data
library(rsvd)


#############
###  DATA ###
#############

data(digits)
data = as.data.frame(digits)

# Dividing on X variables and Y target
dataX = data %>%
  dplyr::select(-label)
dataY = data %>%
  mutate(label=factor(label)) %>% 
  pull(label)

## number of data
n=dim(dataX)[1]

Then, I do some necessary computations prior to the KDN loop I want to efficiently parallel.

##############################
###  PREVIOUS COMPUTATIONS ###
##############################

## number of available data in each class
n_data_classes=table(factor(dataY))

## number of data to be considered as neighbours in each class
k=0.05
k_neighbours_classes=ceiling(n_data_classes*k)

##  DISTANCE MATRIX COMPUTATION
 # this is time consuming but I'm not concerned about this
distance_matrix=as.matrix(parDist(scale(dataX)))

The KDN computation without no parallelization is the next one and it takes 12 secs.

#########################################
### COMPUTING KDN: NO PARALLELIZATION ###
#########################################

## KDN instance level computation
# inicialization of a vector to store KDN instance level values
kdn_instance=numeric(n)

system.time(

  for (ix in 1:n){
    ## Gettig the class of ix data point
    class_ix=dataY[ix]
    ## number of data to be considered as neighbours in this class
    k_value=k_neighbours_classes[class_ix]
    
    # we get the k_value nearest neighbors set of ix
    distances_ix=distance_matrix[ix,]
    distances_ix_ordered=order(distances_ix,decreasing = F)
    knn_set_ix=distances_ix_ordered[2:(k_value+1)]
    
    # Y value of the k_neighbors_set_ix
    Y_value_knn_set_ix=dataY[knn_set_ix]
    # Y value of ix data
    Y_value_ix=dataY[ix]
    
    # number of data in knn_set_ix with different Y value that ix
    knn_set_ix_different_Y_value=length(Y_value_knn_set_ix[Y_value_knn_set_ix!=Y_value_ix])
    kdn_instance[ix]=knn_set_ix_different_Y_value/k_value
  }
)

# user  system elapsed 
# 12.29    0.37   12.67 secs

My attempt to parallel that loop is the following one, using foreach and %dopar%, which takes 35 secs.

######################################
### COMPUTING KDN: PARALLELIZATION ###
######################################

## Preparing for paralleling
# number of cores to use
n.cores <- parallel::detectCores() - 1

# we define the cluster and register it so it can be used by %dopar%
my.cluster <- parallel::makeCluster(n.cores,type = "PSOCK")

# register it to be used by %dopar%
doParallel::registerDoParallel(cl = my.cluster)

## KDN instance level computation
kdn_instance= NULL

#iterator
itx <- iter(distance_matrix, by = 'row')

system.time(

  kdn_instance <- foreach(
    ix = itx,
    .combine = 'c'
  ) %dopar% {
    ## Gettig the class of ix data point
    class_ix=dataY[ix]
    ## number of data to be considered as neighbours in this class
    k_value=k_neighbours_classes[class_ix]
    
    # we get the k_value nearest neighbors set of ix
    distances_ix=distance_matrix[ix,]
    distances_ix_ordered=order(distances_ix,decreasing = F)
    knn_set_ix=distances_ix_ordered[2:(k_value+1)]
    
    # Y value of the k_neighbors_set_ix
    Y_value_knn_set_ix=dataY[knn_set_ix]
    # Y value of ix data
    Y_value_ix=dataY[ix]
    
    # number of data in knn_set_ix with different Y value that ix
    knn_set_ix_different_Y_value=length(Y_value_knn_set_ix[Y_value_knn_set_ix!=Y_value_ix])
    knn_set_ix_different_Y_value/k_value}
)
parallel::stopCluster(cl = my.cluster)
  

# user  system elapsed 
# 12.38    4.64   35.14 secs

As can be seen, the parallel computation is taking more time than the not parallelized one.

My question is: is there something wrong with the parallel processing code? Is there a better way to do it? Maybe it should be done with another package.

Description	Elapsed seconds
stime, baseline `%do%` method	21.479
`registerDoParallel(cores=1)`	20.425
`registerDoParallel(cores=2)`	11.508
`registerDoParallel(cores=3)`	11.107
`registerDoParallel(cores=4)`	9.491
`cl<-makeCluster(1); registerDoParallel(cl)`	26.265
`cl<-makeCluster(2); registerDoParallel(cl)`	14.66
`cl<-makeCluster(3); registerDoParallel(cl)`	12.827
`cl<-makeCluster(4); registerDoParallel(cl)`	11.943

Problem with parallel computing over an index using foreach and %dopar% in R

Answers (1)

Related Questions