Reputation: 115
I am working on text mining project with R. The file size is over 100 MB. I managed to read the file and did some text processing, however, when I get to the point of removing stop words, RStudio crushed. What would be the best solution, please?
Should I split the file into 2 or 3 files, process them and then merge them again before applying any analytics? anyone has the code to split. I tried several options available online and none of them seems to work.
Here is the code I used. Everything worked smoothly except the removing stop words
# Install
install.packages("tm") # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator
install.packages("RColorBrewer") # color palettes
# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
library(readr)
doc <- read_csv(file.choose())
docs <- Corpus(VectorSource(doc))
docs
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
Upvotes: 0
Views: 660
Reputation: 23608
If you have a lot of words in the corpus, R will take a long time removing stopwords. tm removeWords is basicly a giant gsub and works like this:
gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),
collapse = "|")), "", x, perl = TRUE)
Because every word (x) in the corpus is being checked on stopwords, and a 100MB file contains a lot of words, Rstudio might crash as it doesn't receive a response back from R for a while. I'm not sure if there is a timeout built into RStudio somewhere.
Now you could run this code in the R console; this shouldn't crash, but you might wait a long time. You could use the package beepr
to create a sound when the process is done.
If possible, my advice would be to switch to the quanteda
package as this will run in parallel out of the box, is better documented, supported and has less utf-8 issues compared to tm. At least that is my experience.
But you could also try to run your tm code in parallel like the code below and see if this works a bit better:
library(tm)
# your code reading in files
library(parallel)
cores <- detectCores()
# use cores-1 if you want to do anything while the code is running.
cl <- makeCluster(cores)
tm_parLapply_engine(cl)
docs <- Corpus(VectorSource(doc))
# Convert the text to lower case, remove numbers and stopwords
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
# rest of tm code if needed
tm_parLapply_engine(NULL)
stopCluster(cl)
If you are going to do calculations on a big document term matrix that you will get with a lot of words, make sure you are using functions from the slam
package (installed when installing tm). These functions keep the document term matrix in a sparse form. Otherwise your document term matrix might be transformed into a dense matrix and your R session will crash because of too much memory consumption.
Upvotes: 1