Reputation: 654
I'm trying to create a glove model with the data from the kaggle reddit comments challenge. I load the table, pull the body, and now I'm trying to clean the text.
I pulled a small subset (100000 titles) to experiment with, and this is what I have so far:
library(DBI)
require(RSQLite)
library(dplyr)
library(parallel)
library(progress)
library(textclean)
titles = as.character(df$body)
numcores = detectCores()
i = 1
temp = {}
out = {}
while(i <= 100000){
temp = titles[i:(i+1000)] %>%
mclapply(replace_emoji, mc.cores = numcores) %>%
mclapply(replace_url, mc.cores = numcores) %>%
mclapply(replace_contraction, mc.cores = numcores) %>%
mclapply(gsub, pattern = "[^[:alnum:][:space:]]",replacement = "") %>%
mclapply(replace_number, mc.cores = numcores)
i = i+1000
out = c(out, temp)
print(i)
}
But it seems to bet hung in random places. It doesn't cause an error, it just stops. When I look at my activity monitor, I see the CPU usage just drop and never recover.
I don't know what I would need to provide to make this request easier to decompose, so please let me know, and I'll edit it in.
Am I using mclapply wrong?
Im using a mac 16 GB i7, with 8 cores.
Edit: I have looked around and found answers like this and this but they did not help me. Also, it seems to work if I just use lapply.
Upvotes: 1
Views: 899
Reputation: 522
Nested loops caused a problem. One repetition of the parallel loop should not be waiting for the other loop repetition to proceed. A deadlock is occurred in case the parallel loop is determined to be repeated sequentially.
Parallel work does not always produce good efficiency.
Upvotes: 1
Reputation: 654
It seems to work if I don't stack them back to back with mclapply between them, but instead make a new function and use mclapply once.
cleaner = function(vec){
vec %>%
replace_emoji() %>%
replace_url() %>%
replace_contraction() %>%
Num_Al_sep() %>%
gsub(pattern = "[^[:alnum:][:space:]]", replacement = "") %>%
replace_number()
}
i = 1
temp = {}
out = {}
while(i <= 100000){
temp = titles[i:(i+1000)] %>%
mclapply(cleaner)
i = i+1000
out = c(out, temp)
#pb$tick()
print(i)
}
Upvotes: 0