Reputation: 107
I have been using the tm package to run some text analysis. My problem is with creating a matrix term frequent document to build a graph. i want to build a graph with the terms that appears more than 20 times, so
How can i create this matirx ?
### Stage the Data
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
### Explore your data
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
m <- as.matrix(dtm)
dim(m)
write.csv(m, file="DocumentTermMatrix.csv")
termDocMatrix <- as.matrix(tdm)
termDocMatrix
termDocMatrix must containt only term that appears more than 20
Thank you.
Upvotes: 0
Views: 279
Reputation: 23598
You can use findFreqTerms within the documentTermMatrix to find the words in question. See example below. After that you can do your normal matrix calculations on this subset.
Edit based on comment OP: Added extra lines of code show how it works for a TermDocumentMatrix.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("smart"))
#Based on DocumentTermMatrix
dtm <- DocumentTermMatrix(crude)
# filter the documenttermmatrix to only include items with a frequency of 20 or more
dtm <- dtm[, findFreqTerms(dtm, lowfreq = 20)]
inspect(dtm)
<<DocumentTermMatrix (documents: 20, terms: 9)>>
Non-/sparse entries: 107/73
Sparsity : 41%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs bpd crude dlrs market mln oil opec prices reuter
127 0 2 2 1 0 5 0 3 1
144 4 0 0 3 4 12 13 5 1
191 0 2 1 0 0 2 0 0 1
194 0 3 2 0 0 1 0 0 1
211 0 0 2 0 2 1 0 0 1
236 7 2 2 0 4 7 6 5 1
237 0 0 1 0 1 3 1 1 1
242 0 0 0 2 0 3 2 2 1
246 0 0 0 0 0 5 1 1 1
248 2 0 4 8 3 9 6 9 1
273 8 5 2 1 9 5 5 5 1
349 0 2 0 1 0 4 2 1 1
352 0 0 0 2 0 5 2 5 1
353 2 2 0 0 0 4 4 2 1
368 0 0 0 0 0 3 0 0 1
489 0 0 1 0 3 4 0 2 1
502 0 0 1 0 3 5 0 2 1
543 0 2 5 0 0 3 0 2 1
704 0 0 0 2 0 3 0 3 1
708 0 1 0 0 2 1 0 0 1
#based on TermDocumentMatrix
tdm <- TermDocumentMatrix(crude)
# filter the termdocumentmatrix to only include items with a frequency of 20 or more
tdm <- tdm[findFreqTerms(tdm, lowfreq = 20), ]
inspect(tdm)
<<TermDocumentMatrix (terms: 9, documents: 20)>>
Non-/sparse entries: 107/73
Sparsity : 41%
Maximal term length: 6
Weighting : term frequency (tf)
Docs
Terms 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
bpd 0 4 0 0 0 7 0 0 0 2 8 0 0 2 0 0 0 0 0 0
crude 2 0 2 3 0 2 0 0 0 0 5 2 0 2 0 0 0 2 0 1
dlrs 2 0 1 2 2 2 1 0 0 4 2 0 0 0 0 1 1 5 0 0
market 1 3 0 0 0 0 0 2 0 8 1 1 2 0 0 0 0 0 2 0
mln 0 4 0 0 2 4 1 0 0 3 9 0 0 0 0 3 3 0 0 2
oil 5 12 2 1 1 7 3 3 5 9 5 4 5 4 3 4 5 3 3 1
opec 0 13 0 0 0 6 1 2 1 6 5 2 2 4 0 0 0 0 0 0
prices 3 5 0 0 0 5 1 2 1 9 5 1 5 2 0 2 2 2 3 0
reuter 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Upvotes: 1