Jonathan
Jonathan

Reputation: 201

R Text Mining - Word Frequency in Corpus as Number of Documents that Contain the Word

The findFreqTerms() command will tell me the high frequency words and how many times that they appear in the corpus. However, what I am interested in is to know not how many times they appear in the corpus but rather how many documents contain the words. For example, if I have a corpus of 10 documents and only one document contains the word "error", if the word "error" occurs 100 times in that one document, then findFreqTerms(dtm, lowfreq=100) will return "error" (where dtm is my data term matrix). Similarly, using freqcy <- colSums(as.matrix(dtm)), I would find an associated frequency for "error" of 100. However, what I want to be returned is an answer of 1 - I want to know that the word "error" only occurs in one document.

I have a one-off way to do it which I think could build code around to get what I want, but I have to think that there is already a solution for it.

Here is my current approach, using the "crude" dataset.

tdm<-DocumentTermMatrix(crude)
freq <- colSums(as.matrix(tdm))
freq[order(freq)]

This returns "oil" with a frequency of 80.

which(names(freq)=="oil")

This returns 782 and

inspect(tdm[,782])

gives

<> Non-/sparse entries: 20/0 Sparsity : 0% Maximal term length: 3 Weighting : term frequency (tf)

 Terms

Docs oil

127 5

144 11

191 2

194 1

211 1

236 7

237 3

242 3

246 4

248 9

273 5

349 3

352 5

353 4

368 3

489 4

502 4

543 2

704 3

708 1

v<-as.vector(tdm[,782])
length(v[v>0])

Returns 20 - the number of documents that contain the word "oil".

I could create a code to loop over all elements in the tdm and store the length and then select the high frequencies. I was wondering if there was a better solution.

Upvotes: 1

Views: 4214

Answers (2)

Ken Benoit
Ken Benoit

Reputation: 14912

This is a measure called the document frequency of a term feature, which in its simplest form refers to the count of documents in which a term occurs. It is an integral part of common feature weighting schemes such as tf-idf (when inverted and log-transformed).

The quanteda package for text analysis has this built-in, if you are looking for an efficient implementation that works with the sparse document-term matrix structures. Example:

require(quanteda)
inaugCorpus
## Corpus consisting of 57 documents.
myDfm <- dfm(inaugCorpus, verbose = FALSE)
head(docfreq(myDfm))
## fellow-citizens              of             the          senate             and           house 
##              19              57              57               9              57               8 
docfreq(myDfm)["terror"]
## terror 
##      7 

Upvotes: 0

lukeA
lukeA

Reputation: 54287

Assuming you got

library(tm)
docs <- c(doc1="Foo bar bar bar", doc2="Lorem Foo Ipsum")

then you could e.g. do

tdm <- TermDocumentMatrix(Corpus(VectorSource(docs)))
rowSums(as.matrix(tdm)>0)
# bar   foo ipsum lorem 
# 1     2     1     1 

or

tdm <- TermDocumentMatrix(Corpus(VectorSource(docs)), list(weighting=weightBin))
rowSums(as.matrix(tdm))
# bar   foo ipsum lorem 
# 1     2     1     1 

to get the number of documents, which contain each token.

Upvotes: 3

Related Questions