Chas Nelson
Chas Nelson

Reputation: 376

R tm package stemCompletion 'Out of Memory'

I have been trying to work through the following tutorial: http://www.rdatamining.com/examples/text-mining however, instead of using the twitter data I have been using .csv file (unfortunately the contents are sensitive and cannot be made public).

The .csv file has two columns a user key in column A and a piece of narrative text (Response) in column B. The file has been opened with the following code,

Data <- read.csv(file="PATH/FILE.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
Data <- Data[!(Data$Response==""), ]
df<- do.call("rbind", lapply(Data$Response, as.list))

df is a 'list of 91' with each item in the list being of type "character".

The tutorial is followed from the line library(tm) with no differences except the addition of NarrativeCorpus <- tm_map(NarrativeCorpus, PlainTextDocument) after myCorpus <- tm_map(myCorpus, removeWords, myStopwords), which I found was needed for stemming.

The code fails at stem completion: myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=dictCorpus) with the error,

Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : invalid regular expression, reason 'Out of memory'

I have tried to look on-line and on stack overflow with little luck.

I have tried converting the reference dictionary into a list of unique words then back into a corpus (to reduce its size) but to no avail.

I am using R 64-bit 3.2.3 with RStudio Desktop 0.99.891 on a Windows 7 laptop with 4GB RAM. All packages are up to date (according to RStudio).

This is my first SO post so I welcome advise on what I should have included and why, etc..

Upvotes: 1

Views: 1663

Answers (1)

Habib Karbasian
Habib Karbasian

Reputation: 666

I had the similar issue, Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : invalid regular expression and after searching in SO, I found the solution in this thread which was found from this website.

This code should be added after loading your corpus:

content_transformer <- function(x) iconv(x, to='UTF-8-MAC', sub='byte')
myCorpus <- tm_map(myCorpus, content_transformer)

Good luck

Upvotes: 0

Related Questions