user3316599
user3316599

Reputation: 63

Filter rows/documents from Document-Term-Matrix in R

Using the tm-package in R I create a Document-Term-Matrix:

dtm <- DocumentTermMatrix(cor, control = list(dictionary=c("someTerm")))

Whichs results in something like this:

A document-term matrix (291 documents, 1 terms)

Non-/sparse entries: 48/243
Sparsity           : 84%
Maximal term length: 8 
Weighting          : term frequency (tf) 

                   Terms
Docs                someTerm
doc1                       0
doc2                       0
doc3                       7
doc4                       22
doc5                       0

Now I would like to filter this Document-Term-Matrix according to the number of the occurrences of someTerm in the documents. E.g. filter out only the documents where someTerm appears at least once. I.e. doc3 and doc4 here.

How can I achieve this?

Upvotes: 6

Views: 7356

Answers (2)

ElenaZhebel
ElenaZhebel

Reputation: 11

Alternatively, you could use removeSparseTerms function, which remove empty elements (check out the documentation here).

dtm <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum

Upvotes: 1

James King
James King

Reputation: 6375

It's very similar to how you would subset a regular R matrix. For example, to create a document term matrix from the example Reuters dataset with only rows where the term "would" appears more than once:

reut21578 <- system.file("texts", "crude", package = "tm")

reuters <- VCorpus(DirSource(reut21578),
    readerControl = list(reader = readReut21578XMLasPlain))

dtm <- DocumentTermMatrix(reuters)
v <- as.vector(dtm[,"would"]>1)
dtm2 <- dtm[v, ]

> inspect(dtm2[, "would"])
A document-term matrix (3 documents, 1 terms)

Non-/sparse entries: 3/0
Sparsity           : 0%
Maximal term length: 5 
Weighting          : term frequency (tf)

     Terms
Docs  would
  246     2
  489     2
  502     2

A tm document term matrix is a simple triplet matrix from package slam so the slam documentation helps in figuring out how to manipulate dtms.

Upvotes: 6

Related Questions