Travasaurus
Travasaurus

Reputation: 654

mclapply hangs when using multiple instances back to back

I'm trying to create a glove model with the data from the kaggle reddit comments challenge. I load the table, pull the body, and now I'm trying to clean the text.

I pulled a small subset (100000 titles) to experiment with, and this is what I have so far:

library(DBI)
require(RSQLite)
library(dplyr)
library(parallel)
library(progress)
library(textclean)

titles = as.character(df$body)
numcores = detectCores()

i = 1
temp = {}
out = {}
while(i <= 100000){
  temp = titles[i:(i+1000)] %>%
    mclapply(replace_emoji, mc.cores = numcores) %>%
    mclapply(replace_url, mc.cores = numcores) %>%
    mclapply(replace_contraction, mc.cores = numcores) %>%
    mclapply(gsub, pattern = "[^[:alnum:][:space:]]",replacement = "") %>% 
    mclapply(replace_number, mc.cores = numcores) 
  i = i+1000
  out = c(out, temp)
  print(i)
}

But it seems to bet hung in random places. It doesn't cause an error, it just stops. When I look at my activity monitor, I see the CPU usage just drop and never recover.

I don't know what I would need to provide to make this request easier to decompose, so please let me know, and I'll edit it in.

Am I using mclapply wrong?

Im using a mac 16 GB i7, with 8 cores.

Edit: I have looked around and found answers like this and this but they did not help me. Also, it seems to work if I just use lapply.

Upvotes: 1

Views: 899

Answers (2)

Sang won kim
Sang won kim

Reputation: 522

Nested loops caused a problem. One repetition of the parallel loop should not be waiting for the other loop repetition to proceed. A deadlock is occurred in case the parallel loop is determined to be repeated sequentially.

Parallel work does not always produce good efficiency.

Upvotes: 1

Travasaurus
Travasaurus

Reputation: 654

It seems to work if I don't stack them back to back with mclapply between them, but instead make a new function and use mclapply once.

cleaner = function(vec){
 vec %>%
    replace_emoji() %>%
    replace_url() %>%
    replace_contraction() %>%
    Num_Al_sep() %>%
    gsub(pattern = "[^[:alnum:][:space:]]", replacement = "") %>%
    replace_number()
}

i = 1
temp = {}
out = {}
while(i <= 100000){
  temp = titles[i:(i+1000)] %>%
    mclapply(cleaner) 
  i = i+1000
  out = c(out, temp)
  #pb$tick()
  print(i)
}

Upvotes: 0

Related Questions