Reputation: 201
The findFreqTerms() command will tell me the high frequency words and how many times that they appear in the corpus. However, what I am interested in is to know not how many times they appear in the corpus but rather how many documents contain the words. For example, if I have a corpus of 10 documents and only one document contains the word "error", if the word "error" occurs 100 times in that one document, then findFreqTerms(dtm, lowfreq=100) will return "error" (where dtm is my data term matrix). Similarly, using freqcy <- colSums(as.matrix(dtm)), I would find an associated frequency for "error" of 100. However, what I want to be returned is an answer of 1 - I want to know that the word "error" only occurs in one document.
I have a one-off way to do it which I think could build code around to get what I want, but I have to think that there is already a solution for it.
Here is my current approach, using the "crude" dataset.
tdm<-DocumentTermMatrix(crude)
freq <- colSums(as.matrix(tdm))
freq[order(freq)]
This returns "oil" with a frequency of 80.
which(names(freq)=="oil")
This returns 782 and
inspect(tdm[,782])
gives
<> Non-/sparse entries: 20/0 Sparsity : 0% Maximal term length: 3 Weighting : term frequency (tf)
Terms
Docs oil
127 5
144 11
191 2
194 1
211 1
236 7
237 3
242 3
246 4
248 9
273 5
349 3
352 5
353 4
368 3
489 4
502 4
543 2
704 3
708 1
v<-as.vector(tdm[,782])
length(v[v>0])
Returns 20 - the number of documents that contain the word "oil".
I could create a code to loop over all elements in the tdm and store the length and then select the high frequencies. I was wondering if there was a better solution.
Upvotes: 1
Views: 4214
Reputation: 14912
This is a measure called the document frequency of a term feature, which in its simplest form refers to the count of documents in which a term occurs. It is an integral part of common feature weighting schemes such as tf-idf (when inverted and log-transformed).
The quanteda package for text analysis has this built-in, if you are looking for an efficient implementation that works with the sparse document-term matrix structures. Example:
require(quanteda)
inaugCorpus
## Corpus consisting of 57 documents.
myDfm <- dfm(inaugCorpus, verbose = FALSE)
head(docfreq(myDfm))
## fellow-citizens of the senate and house
## 19 57 57 9 57 8
docfreq(myDfm)["terror"]
## terror
## 7
Upvotes: 0
Reputation: 54287
Assuming you got
library(tm)
docs <- c(doc1="Foo bar bar bar", doc2="Lorem Foo Ipsum")
then you could e.g. do
tdm <- TermDocumentMatrix(Corpus(VectorSource(docs)))
rowSums(as.matrix(tdm)>0)
# bar foo ipsum lorem
# 1 2 1 1
or
tdm <- TermDocumentMatrix(Corpus(VectorSource(docs)), list(weighting=weightBin))
rowSums(as.matrix(tdm))
# bar foo ipsum lorem
# 1 2 1 1
to get the number of documents, which contain each token.
Upvotes: 3