Reputation: 838
I have an SQL database with 7 million+ records, each record containing some text. Within each record I want to perform text analysis, say count the occurences of specific words. I've tried R's tokenize
function within the openNLP
package which works great for small files, but 7 million records * between 1-100 words per record gets too large for R to hold in a data.frame
. I thought about using R's bigmemory
or ff
packages, or even the mapReduce
package. Do you guys have a preferred approach or package for this type of analysis?
Upvotes: 0
Views: 383
Reputation: 3380
On the SQL side you could extract also for each entry the len
, then apply a replace(" yourWord ","")
(with flanking spaces...) to it, calculate again the total string length and then take the differences between those two, that should do the trick. My SQL skills are not so well that I could present here easily an running example, sorry for that...
Upvotes: 0
Reputation: 109964
Maybe approach it in parallel. I used parLapply
b/c I believe it works on all three OS.
wc <- function(x) length(unlist(strsplit(x, "\\s+")))
wordcols <- rep("I like icecream alot.", 100000)
library(parallel)
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("wc", "wordcols"), envir=environment())
output <- parLapply(cl, wordcols, function(x) {
wc(x)
}
)
stopCluster(cl)
sum(unlist(output))
Upvotes: 1