mchangun
mchangun

Reputation: 10322

Processing Large Text Files in R

I have a 6GB data set of 6 million messages that I want to process - my goal is to create a Document Term Matrix for my dataset but I need to do some pre-processing (strip out HTML tags, stemming, stop-word removal, etc) first.

Here is how I am currently attempting to do all this:

library(data.table)
library(tm)

wordStem2 <- function(strings){
  sapply(lapply(strsplit(stripWhitespace(strings), " "), wordStem), function(x) paste(x, collapse=" "))
}

load("data/train.RData")
sampletrainDT <- as.data.table(train)
rm(train)
setkey(sampletrainDT, Id)

object.size(sampletrainDT) # 5,632,195,744 bytes

gc()
sampletrainDT[, Body := tolower(Body)]
object.size(sampletrainDT) # 5,631,997,712 bytes, but rsession usage is at 12 GB. gc doesn't help.
gc()
sampletrainDT[, Body := gsub("<(.|\n)*?>", " ", Body)] # remove HTML tags
sampletrainDT[, Body := gsub("\n", " ", Body)] # remove \n
sampletrainDT[, Body := removePunctuation(Body)]
sampletrainDT[, Body := removeNumbers(Body)]
sampletrainDT[, Body := removeWords(Body, stopwords("english"))]
sampletrainDT[, Body := stripWhitespace(Body)]
sampletrainDT[, Body := wordStem2(Body)]

ls at each line:

ls()
[1] "sampletrainDT" "wordStem2"  

Each row of sampletrainDT is one message and the main column is Body. The others contain metadata like docid etc etc.

This runs pretty quickly (10 mins) when I am only working with a subset of the data (10%) but it doesn't even complete if I use the full data set because I run out of RAM on this line sampletrainDT[, Body := gsub("<(.|\n)*?>", " ", Body)] # remove HTML tags. Running gc() between the lines doesn't seem to improve the situation.

I've spent a couple of days Googling for a solution but I haven't been able to find a good solution yet so I'm interested to hear from others who have a lot of experience in this. Here are some options I am considering:

  1. ff or bigmemory - hard to use and not suited for text
  2. Databases
  3. Read in chunks at a time, process and append to file (better suited for Python?)
  4. PCorpus from tm library
  5. Map-reduce - done locally but hopefully in a memory friendly way
  6. Is R just not the tool for this?

I would like to keep this running on a single machine (16 GB laptop) instead of using a big machine on EC2. 6GB of data doesn't seem insurmountable if done properly!

Upvotes: 0

Views: 1723

Answers (1)

mrip
mrip

Reputation: 15163

I'm not sure exactly what's going on, but here are some hopefully useful tips. First, this is a function that I use to monitor which objects are taking up the memory:

lsBySize<-function(k=20,envir=globalenv()){
  z <- sapply(ls(envir=envir), function(x) object.size(get(x)))
  ret<-sort(z,T)
  if(k>0)
    ret<-ret[1:min(k,length(ret))]

  as.matrix(ret)/10^6
}

Running gc() at any time will tell you how much memory is currently being used. If sum(lsBySize(length(ls))) is not approximately equal to the amount of memory used as reported by gc(), then something strange is going on. In this case, please edit the OP to show the output from the R session of running these two commands consecutively. Also, in order to isolate this issue, it is better to work with data.frames, at least for now, because the internals of data.tables are more complicated and opaque.

Upvotes: 1

Related Questions