Reputation: 1572
I am using a text file of 160 MB and doing data mining, but seems once I convert it to matrix to know the word frequency then its demanding too much memory, can someone one please help me in this
> dtm <- DocumentTermMatrix(clean)
> dtm
<<DocumentTermMatrix (documents: 472029, terms: 171548)>>
Non-/sparse entries: 3346670/80972284222
Sparsity : 100%
Maximal term length: 126
Weighting : term frequency (tf)
> as.matrix(dtm)
Error: cannot allocate vector of size 603.3 Gb
Upvotes: 1
Views: 935
Reputation: 10855
@Vineet here is the math that shows why R tried to allocate 603Gb to convert the document term matrix to a non-sparse matrix. Each number cell in a matrix in R consumes 8 bytes. Based on the size of the document term matrix in the question, the math looks like:
> #
> # calculate memory consumed by matrix
> #
>
> rows <- 472029 #
> cols <- 171548
> # memory in gigabytes
> rows * cols * 8 / (1024 * 1024 * 1024)
[1] 603.3155
If you want to calculate the word frequencies, you're better off generating 1-grams and then summarizing them into a frequency distribution.
With the quanteda
package the code would look like this.
words <- tokenize(...)
ngram1 <- unlist(tokens_ngrams(words,n=1))
ngram1freq <- data.frame(table(ngram1))
regards,
Len
2017-11-24 UPDATE: Here is a complete example from the quanteda package that generates the frequency distribution from a document feature matrix using the textstat_frequency()
function, as well as a barplot()
for the top 20 features.
This approach does not require the generation & aggregation of n-grams into a frequency distribution.
library(quanteda)
myCorpus <- corpus(data_char_ukimmig2010)
system.time(theDFM <- dfm(myCorpus,tolower=TRUE,
remove=c(stopwords(),",",".","-","\"","'","(",")",";",":")))
system.time(textFreq <- textstat_frequency(theDFM))
hist(textFreq$frequency,
main="Frequency Distribution of Words: UK 2010 Election Manifestos")
top20 <- textFreq[1:20,]
barplot(height=top20$frequency,
names.arg=top20$feature,
horiz=FALSE,
las=2,
main="Top 20 Words: UK 2010 Election Manifestos")
...and the resulting barplot:
Upvotes: 3