GabyLP
GabyLP

Reputation: 3781

do parallel in R with loops

I have a table cluster (with more than one column):

head(cluster[,c('cuil_direccion')])
[1] "PJE INDEA 98 5                    "
[2] "PJE INDE 98 5                    "
[3] "B 34 VIV RECRE 57 00                 "
[4] "S CASA DE GO 600                  "
[5] "RCCA 958 00 o                             "
[6] "JUAN B  1900                       "

I need to run a function that for each line extracts the numbers and paste them in a list. I'm using: str_extract_all. Since the table is huge I'd like to split data and use different cores for each split. I tried:

library(foreach)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))

crea_tabla <- function(x){
  xlst <- split(x, 1:nrow(x)) 
  pred <- foreach(i = xlst, .combine = rbind) %dopar% {
    library(stringr)
    d<-data.frame(dir='a', E_numdir=1)
    j=1  
    DIR<-i$cuil_direccion[j]
    E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]]
    d<-rbind(d, data.frame( dir=DIR , 
                         E_numdir=toString(E_NUMDIR)))
    j=1+j    
  }
}

then I ran

crea_tabla(cluster)

And I get an empty result.

I'm not sure about the way doparallel uses data. E.G this part:

 library(stringr)
    d<-data.frame(dir='a', E_numdir=1)
    j=1  

Should I write before or after %dopar%?

EDITION

num_cores<-detectCores(all.tests=TRUE)
registerDoParallel(cores=detectCores(all.tests=TRUE))



crea_tabla <- function(x, num_cores){
  xlst <- split(x, 1:nrow(x)) 
  j=1 
  d<-data.frame(dir='a', E_numdir=1) 
  pred <- foreach(i = seq_along(xlst), .combine = rbind) %dopar% {
  print(i*num_cores/nrow(x))
    library(stringr)
    DIR<-xlst[[i]]$cuil_direccion
    E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]]
    data.frame(dir=DIR , E_numdir=toString(E_NUMDIR))    
  }
  d <- rbind(d, pred)
  return(d)
}

a<-crea_tabla(cluster, num_cores)

Upvotes: 0

Views: 1748

Answers (1)

cdeterman
cdeterman

Reputation: 19950

There are several things you need to make note of. First, you are correct to be suspicious of where you put initialized variables. You should declare them before the loop (no point in reloading the library several times). Second, you don't need the j variable. Just seq_along your list and index your list.

Next, regarding foreach, you have specified that the output will be rbind so you have not need to call rbind inside the loop. If you want that first row, you just rbind the results of the foreach loop to the initial data.frame. The following accomplishes what it appears you are trying to do.

Lastly, I assume you realize this, but make sure you set up your backend. I don't know which OS you are using but you would need to use another package like doParallel, doMC or doSNOW.

# recreate your data
cluster <- read.table(header=F, text='
"PJE INDEA 98 5                    "
"PJE INDE 98 5                    "
"B 34 VIV RECRE 57 00                 "
"S CASA DE GO 600                  "
"RCCA 958 00 o                             "
"JUAN B  1900                       "
')
colnames(cluster) <- 'cuil_direccion'

library(stringr)
library(foreach)

crea_tabla <- function(x){
    xlst <- split(x, 1:nrow(x)) 
    j=1 
    d<-data.frame(dir='a', E_numdir=1) 
    pred <- foreach(i = seq_along(xlst), .combine = rbind) %dopar% {
        DIR<-xlst[[i]]$cuil_direccion
        E_NUMDIR <- str_extract_all(DIR,"\\(?[0-9]+\\)?")[[1]]
        data.frame(dir=DIR , E_numdir=toString(E_NUMDIR))    
    }
    d <- rbind(d, pred)
    return(d)
}

crea_tabla(cluster)

                                         dir   E_numdir
1                                          a          1
2         PJE INDEA 98 5                          98, 5
3          PJE INDE 98 5                          98, 5
4      B 34 VIV RECRE 57 00                  34, 57, 00
5         S CASA DE GO 600                          600
6 RCCA 958 00 o                                 958, 00
7        JUAN B  1900                              1900

Upvotes: 2

Related Questions