Reputation: 93
I'm using R with data mining purposes, the thing is that I connected it with elasticsearch and retrieve a dataset of Shakespeare Complete Works.
library("elastic")
connect()
maxi <- count(index = 'shakespeare')
s <- Search(index = 'shakespeare',size=maxi)
dat <- s$hits$hits[[1]]$`_source`$text_entry
for (i in 2:maxi) {
dat <- c(dat , s$hits$hits[[i]]$`_source`$text_entry)
}
rm(s)
Since I only want the dialogue I have to do a for to get only that. The object 's' is around 250 Mb and 'dat' only 10 Mb.
After that I want to do a tf_idf matrix but apparently I can't since it uses too much memory (I have 4GB of RAM), here is my code:
library("tm")
myCorpus <- Corpus(VectorSource(dat))
myCorpus <- tm_map(myCorpus, content_transformer(tolower),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removeNumbers),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removePunctuation),lazy = TRUE)
myCorpus <- tm_map(myCorpus, content_transformer(removeWords), stopwords("en"),lazy = TRUE)
myTdm <- TermDocumentMatrix(myCorpus,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
myCorpus is around 400 Mb.
But then I do:
> m <- as.matrix(myTdm)
Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow
Any ideas? Is it too much for R the dataset?
EDIT:
RemoveSparseTerms doesn't works well, I use sparse = 0.95 and It leaves 0 terms:
inspect(myTdm)
<<TermDocumentMatrix (terms: 27227, documents: 111396)>>
Non-/sparse entries: 410689/3032568203
Sparsity : 100%
Maximal term length: 37
Weighting : term frequency (tf)
Upvotes: 1
Views: 675
Reputation: 1981
A term document matrix will, in general, contain lots of zeros; lots of terms will only appear in one document. The tm
library stores term document matrices as sparse matrices, which are a space efficient way of storing this type of matrix. (You can read more about the storage format used by tm
here: http://127.0.0.1:19303/library/slam/html/matrix.html)
When you try to convert to a regular matrix, this is a lot less space efficient and is making R run out of memory. You can use
removeSparseTerms
before you convert to a matrix, to try make the full matrix small enough to work with.
I'm pretty sure this is what is happening but it's hard to know for sure without being able to run your code on your machine.
Upvotes: 3