LaymoO
LaymoO

Reputation: 107

How to build a Matrix of Term Frequent Document from a set of texts

I have been using the tm package to run some text analysis. My problem is with creating a matrix term frequent document to build a graph. i want to build a graph with the terms that appears more than 20 times, so

How can i create this matirx ?

### Stage the Data      
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs)   


### Explore your data      
freq <- colSums(as.matrix(dtm))   
length(freq)   
ord <- order(freq)   
m <- as.matrix(dtm)   
dim(m)  

write.csv(m, file="DocumentTermMatrix.csv")   
termDocMatrix <- as.matrix(tdm)
termDocMatrix

termDocMatrix must containt only term that appears more than 20

Thank you.

Upvotes: 0

Views: 279

Answers (1)

phiver
phiver

Reputation: 23598

You can use findFreqTerms within the documentTermMatrix to find the words in question. See example below. After that you can do your normal matrix calculations on this subset.

Edit based on comment OP: Added extra lines of code show how it works for a TermDocumentMatrix.

library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("smart"))


#Based on DocumentTermMatrix
dtm <- DocumentTermMatrix(crude)
# filter the documenttermmatrix to only include items with a frequency of 20 or more
dtm <- dtm[, findFreqTerms(dtm, lowfreq = 20)]
inspect(dtm)

<<DocumentTermMatrix (documents: 20, terms: 9)>>
Non-/sparse entries: 107/73
Sparsity           : 41%
Maximal term length: 6
Weighting          : term frequency (tf)

     Terms
Docs  bpd crude dlrs market mln oil opec prices reuter
  127   0     2    2      1   0   5    0      3      1
  144   4     0    0      3   4  12   13      5      1
  191   0     2    1      0   0   2    0      0      1
  194   0     3    2      0   0   1    0      0      1
  211   0     0    2      0   2   1    0      0      1
  236   7     2    2      0   4   7    6      5      1
  237   0     0    1      0   1   3    1      1      1
  242   0     0    0      2   0   3    2      2      1
  246   0     0    0      0   0   5    1      1      1
  248   2     0    4      8   3   9    6      9      1
  273   8     5    2      1   9   5    5      5      1
  349   0     2    0      1   0   4    2      1      1
  352   0     0    0      2   0   5    2      5      1
  353   2     2    0      0   0   4    4      2      1
  368   0     0    0      0   0   3    0      0      1
  489   0     0    1      0   3   4    0      2      1
  502   0     0    1      0   3   5    0      2      1
  543   0     2    5      0   0   3    0      2      1
  704   0     0    0      2   0   3    0      3      1
  708   0     1    0      0   2   1    0      0      1

#based on TermDocumentMatrix
tdm <- TermDocumentMatrix(crude)
# filter the termdocumentmatrix to only include items with a frequency of 20 or more
tdm <- tdm[findFreqTerms(tdm, lowfreq = 20), ]

inspect(tdm)
<<TermDocumentMatrix (terms: 9, documents: 20)>>
Non-/sparse entries: 107/73
Sparsity           : 41%
Maximal term length: 6
Weighting          : term frequency (tf)

        Docs
Terms    127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
  bpd      0   4   0   0   0   7   0   0   0   2   8   0   0   2   0   0   0   0   0   0
  crude    2   0   2   3   0   2   0   0   0   0   5   2   0   2   0   0   0   2   0   1
  dlrs     2   0   1   2   2   2   1   0   0   4   2   0   0   0   0   1   1   5   0   0
  market   1   3   0   0   0   0   0   2   0   8   1   1   2   0   0   0   0   0   2   0
  mln      0   4   0   0   2   4   1   0   0   3   9   0   0   0   0   3   3   0   0   2
  oil      5  12   2   1   1   7   3   3   5   9   5   4   5   4   3   4   5   3   3   1
  opec     0  13   0   0   0   6   1   2   1   6   5   2   2   4   0   0   0   0   0   0
  prices   3   5   0   0   0   5   1   2   1   9   5   1   5   2   0   2   2   2   3   0
  reuter   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

Upvotes: 1

Related Questions