ilanman
ilanman

Reputation: 838

What's the fastest way for counting words in a large dataset using R?

I have an SQL database with 7 million+ records, each record containing some text. Within each record I want to perform text analysis, say count the occurences of specific words. I've tried R's tokenize function within the openNLP package which works great for small files, but 7 million records * between 1-100 words per record gets too large for R to hold in a data.frame. I thought about using R's bigmemory or ff packages, or even the mapReduce package. Do you guys have a preferred approach or package for this type of analysis?

Upvotes: 0

Views: 383

Answers (2)

Daniel Fischer
Daniel Fischer

Reputation: 3380

On the SQL side you could extract also for each entry the len, then apply a replace(" yourWord ","") (with flanking spaces...) to it, calculate again the total string length and then take the differences between those two, that should do the trick. My SQL skills are not so well that I could present here easily an running example, sorry for that...

Upvotes: 0

Tyler Rinker
Tyler Rinker

Reputation: 109964

Maybe approach it in parallel. I used parLapply b/c I believe it works on all three OS.

wc <- function(x) length(unlist(strsplit(x, "\\s+")))

wordcols <- rep("I like icecream alot.", 100000)

library(parallel)
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("wc", "wordcols"), envir=environment())
output <- parLapply(cl, wordcols, function(x) {
        wc(x)
    }
)
stopCluster(cl)  
sum(unlist(output))

Upvotes: 1

Related Questions